As motivation for this course, we’ll go back to 2002 and try to build a baseball team with a limited budget. Note that in 2002, the Yankees payroll was almost $130 million, and had more than tripled the Oakland A’s $40 million budget. [][Statistics have been used in baseball since its beginnings]. Note that the data set we will be using, included in the Lahman Library, goes back to the 19th century. For example, a summary of statistics we will describe soon, the batting average, has been used to summarize a batter’s success for decades. Other statistics such as home runs, runs batted in, and stolen bases, we’ll describe all this soon, are reported for each player in the game summaries included in the sports section of newspapers.
And players are rewarded for high numbers. Although summary statistics were widely used in baseball, data analysis per se was not. These statistics were arbitrarily decided on without much thought as to whether they actually predicted, or were related to helping a team win. This all changed with Bill James. In the late 1970s, this aspiring writer and baseball fan started publishing articles describing more in-depth analysis of baseball data. He named the approach of using data to predict what outcomes best predict if a team wins [][sabermetrics]. Until Billy Beane made sabermetrics the center of his baseball operations, Bill James’ work was mostly ignored by the baseball world.
Today, pretty much every team uses the approach, and it has gone beyond baseball into other sports. In this course, to simplify the example we use, we’ll focus on predicting scoring runs. We will ignore pitching and fielding, although those are important as well. We will see how regression analysis can help develop strategies to build a competitive baseball team with a constrained budget. [][The approach can be divided into two separate data analyses. In the first, we determine which recorded player specific statistics predict runs. In the second, we examine if players were undervalued based on what our first analysis predicts.]
[][Textbook link]
The corresponding section of the textbook is the case study on Moneyball. https://rafalab.github.io/dsbook/linear-models.html#case-study-moneyball
[][Key point]
Bill James was the originator of sabermetrics, the approach of using data to predict what outcomes best predicted if a team would win.
image here
image here
We actually don’t need to understand all the details about the game of baseball, which has over 100 rules, to see how regression will help us find undervalued players. Here, we distill the sport to the basic knowledge one needs to know to effectively attack the data science challenge. Let’s get started. [][The goal of a baseball game is to score more runs, they’re like points, than the other team]. Each team has nine batters that bat in a predetermined order. After the ninth batter hits, we start with the first again. Each time they come to bat, we call it a plate appearance, PA. At each plate appearance, the other team’s pitcher throws the ball and you try to hit it. The plate appearance ends with a binary outcome–you either make an out, that’s a failure and sit back down, or you don’t, that’s a success and you get to run around the bases and potentially score a run. Each team gets nine tries, referred to as innings, to score runs. Each inning ends after three outs, after you’ve failed three times.
From these examples, we see how luck is involved in the process. When you bat you want to hit the ball hard. If you hit it hard enough, it’s a home run, the best possible outcome as you get at least one automatic run. But sometimes, due to chance, you hit the ball very hard and a defender catches it, which makes it an out, a failure. In contrast, sometimes you hit the ball softly but it lands just in the right place. You get a hit which is a success. The fact that there is chance involved hints at why probability models will be involved in all this. Now there are [][several ways to succeed]. Understanding this distinction will be important for our analysis.
When you hit the ball you want to pass as many bases as possible. There are four bases with the fourth one called home plate. Home plate is where you start, where you try to hit. So the bases form a cycle. [][If you get home, you score a run]. We’re simplifying a bit. But there are five ways you can succeed. In other words, not making an out. First one is called a base on balls. This is when the pitcher does not pitch well and you get to go to first base. A single is when you hit the ball and you get to first base. A double is when you hit the ball and you go past first base to second. Triple is when you do that but get to third. And a home run is when you hit the ball and go all the way home and score a run. If you get to a base, you still have a chance of getting home and scoring a run if the next batter hits successfully. While you are on base, you can also try to [][steal a base]. If you run fast enough, you can try to go from first to second or from second to third without the other team tagging you.
All right. Now historically, the batting average has been considered the most important offensive statistic. To define this average, we define a hit and an at bat. Singles, doubles, triples, and home runs are hits. But remember, there’s a fifth way to be successful, the base on balls. That is not a hit. An at bat is the number of times you either get a hit or make an out, bases on balls are excluded. The batting average is simply hits divided by at bats. And it is considered the main measure of a success rate. Today, in today’s game, this success rates ranges from player to player from about 20% to 38%. We refer to the batting average in thousands. So for example, if your success rate is 25% we say you’re batting 250.
One of Bill James’ first important insights is that the [][batting average ignores bases on balls but bases on balls is a success]. So a player that gets many more bases on balls than the average player might not be recognized if he does not excel in batting average. But is this player not helping produce runs? No award is given to the player with the most bases on balls. In contrast, the total number of stolen bases are considered important and an award is given out to the player with the most. But players with high totals of stolen bases also make outs as they do not always succeed.
So does a player with a high stolen base total help produce runs? Can
we use data size to determine if it’s better to pay for bases on balls
or stolen bases? [][One of the challenges in this analysis is that it is
not obvious how to determine if a player produces runs because so much
depends on his teammates]. We do keep track of the number of runs scored
by our player. But note that if you hit after someone who hits
many home runs, you will score many runs
(Super batter hit the ball far away thus you can run many bases as well, lucky player).
But these runs don’t necessarily happen if we hire this player but not
his home run hitting teammate. [][However, we can examine team level
statistics] (How ???). How do teams with many stolen bases
compare to teams with few? How about bases on balls? We have data. Let’s
examine some.
[][Textbook link]
This video corresponds to the textbook section on baseball basics. https://rafalab.github.io/dsbook/linear-models.html#baseball-basics
[][Key points]
The goal of a baseball game is to score more runs (points) than the other team.
Each team has 9 batters who have an opportunity to hit a ball with a bat in a predetermined order.
Each time a batter has an opportunity to bat, we call it a plate appearance (PA).
The PA ends with a binary outcome: the batter either makes an out (failure) and returns to the bench or the batter doesn’t (success) and can run around the bases, and potentially score a run (reach all 4 bases).
We are simplifying a bit, but there are five ways a batter can succeed (not make an out):
Base on balls (BB): the pitcher fails to throw the ball through a predefined area considered to be hittable (the strike zone), so the batter is permitted to go to first base.
Single: the batter hits the ball and gets to first base.
Double (2B): the batter hits the ball and gets to second base.
Triple (3B): the batter hits the ball and gets to third base.
Home Run (HR): the batter hits the ball and goes all the way home and scores a run.
Historically, the batting average has been considered the most important offensive statistic. To define this average, we define a hit (H) and an at bat (AB). Singles, doubles, triples, and home runs are hits. The fifth way to be successful, a walk (BB), is not a hit. An AB is the number of times you either get a hit or make an out; BBs are excluded. The batting average is simply H/AB and is considered the main measure of a success rate.
Note: The video states that if you hit AFTER someone who hits many home runs, you will score many runs, while the textbook states that if you hit BEFORE someone who hits many home runs, you will score many runs. The textbook wording is accurate.
image here
image here
plate appearance
image here
In baseball, a home run (abbreviated HR) is scored when the ball is hit in such a way that the batter is able to circle the bases and reach home safely
image here
image here
Base on ball
A single is you hit the ball and get to first base
image here
image here
baseball home run, go all the way home and score a run
baseball steal a base
Image here
batting average equation
image here
Let’s start looking at some baseball data and try to answer your questions using these data. First one, do teams that hit more home runs score more runs? We know what the answer to this will be, but let’s look at the data anyways. We’re going to examine data from 1961 to 2001. We end at 2001 because, remember, we’re back in 2002, getting ready to build a team.
We started in 1961, because that year, the league changed from 154 games to 162 games. The visualization of choice when exploring the relationship between two variables like home runs and runs is a scatterplot. The following code shows you how to make that scatterplot. We start by loading the Lahman library that has all these baseball statistics. And then we simply make a scatterplot using 2d plot. Here’s a plot of runs per game versus home runs per game.
The plot shows a very strong association–teams with more home runs tended to score more runs. Now, let’s examine the relationship between stolen bases and wins. Here are the runs per game plotted against stolen bases per game. Here, the relationship is not as clear. Finally, let’s examine the relationship between bases on balls and runs. Here are runs per game versus bases on balls per game. Although the relationship is not as strong as it was for home runs, we do see a pretty strong relationship here.
We know that, by definition, home runs cause runs, because when you hit a home run, at least one run will score. Now it could be that home runs also cause the bases on balls. If you understand the game, you will agree with me that that could be the case. [][So it might appear that a base on ball is causing runs, when in fact, it’s home runs that’s causing both]. This is called [][confounding]. An important concept you will learn about. Linear regression will help us parse all this out and quantify the associations. This will then help us determine what players to recruit. Specifically, we will try to predict things like how many more runs will the team score if we increase the number of bases on balls but keep the home runs fixed. Regression will help us answer this question, as well.
[][Textbook link]
This video corresponds to the base on balls or stolen bases textbook section. https://rafalab.github.io/dsbook/linear-models.html#base-on-balls-or-stolen-bases
[][Key points]
The visualization of choice when exploring the relationship between two variables like home runs and runs is a scatterplot.
Code: Scatterplot of the relationship between HRs and wins
library(Lahman)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dslabs)
ds_theme_set()
Teams %>%
filter(yearID %in% 1961:2001) %>%
mutate(HR_per_game = HR / G, R_per_game = R / G) %>%
ggplot(aes(HR_per_game, R_per_game)) +
geom_point(alpha = 0.5)
# Code: Scatterplot of the relationship between stolen bases and wins
Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(SB_per_game = SB / G, R_per_game = R / G) %>%
ggplot(aes(SB_per_game, R_per_game)) +
geom_point(alpha = 0.5)
# Code: Scatterplot of the relationship between bases on balls and runs
Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(BB_per_game = BB / G, R_per_game = R / G) %>%
ggplot(aes(BB_per_game, R_per_game)) +
geom_point(alpha = 0.5)
Image here
library(Lahman)
library(dplyr)
library(ggplot2)
#ggplot2::ds_theme_set()
Teams %>%
filter(yearID %in% 1961:2001) %>%
mutate(HR_per_game = HR/G, R_per_game = R/G) %>%
ggplot2::ggplot(aes(HR_per_game, R_per_game)) +
geom_point(alpha=0.5)
Image here
Image here
image here
caused the both
image here
library(Lahman)
library(tidyverse) # this one includes dplyr and ggplot2 and many others
library(dslabs)
ds_theme_set()
Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(HR_per_game = HR / G, R_per_game = R / G) %>%
ggplot(aes(HR_per_game, R_per_game)) +
geom_point(alpha = 0.5)
# Code: Scatterplot of the relationship between stolen bases and wins
Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(SB_per_game = SB / G, R_per_game = R / G) %>%
ggplot(aes(SB_per_game, R_per_game)) +
geom_point(alpha = 0.5)
# Code: Scatterplot of the relationship between bases on balls and runs
Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(BB_per_game = BB / G, R_per_game = R / G) %>%
ggplot(aes(BB_per_game, R_per_game)) +
geom_point(alpha = 0.5)
Comprehension Check due May 29, 2022 00:29 AWST Completed
1/1 point (graded) What is the application of statistics and data science to baseball called? Moneyball Sabermetrics The “Oakland A’s Approach” There is no specific name for this; it’s just data science.
1/1 point (graded) Which of the following outcomes is not included in the batting average? A home run A base on balls An out A single
1/1 point (graded) Why do we consider team statistics as well as individual player statistics? The success of any individual player also depends on the strength of their team. Team statistics can be easier to calculate. The ultimate goal of sabermetrics is to rank teams, not players.
1.0/1.0 point (graded) You want to know whether teams with more at-bats per game have more runs per game. What R code below correctly makes a scatter plot for this relationship?
Teams %>% filter(yearID %in% 1961:2001 ) %>% ggplot(aes(AB, R)) + geom_point(alpha = 0.5)
Teams %>% filter(yearID %in% 1961:2001 ) %>% mutate(AB_per_game = AB/G, R_per_game = R/G) %>% ggplot(aes(AB_per_game, R_per_game)) + geom_point(alpha = 0.5)
Teams %>% filter(yearID %in% 1961:2001 ) %>% mutate(AB_per_game = AB/G, R_per_game = R/G) %>% ggplot(aes(AB_per_game, R_per_game)) + geom_line()
Teams %>% filter(yearID %in% 1961:2001 ) %>% mutate(AB_per_game = AB/G, R_per_game = R/G) %>% ggplot(aes(R_per_game, AB_per_game)) + geom_point()
1.0/1.0 point (graded) What does the variable “SOA” stand for in the Teams table?
Hint: make sure to use the help file (?Teams). sacrifice out slides or attempts strikeouts by pitchers accumulated singles
1/1 point (graded)
Load the Lahman library. Filter the Teams data frame to include years from 1961 to 2001. Make a scatterplot of runs per game versus at bats (AB) per game. Which of the following is true? There is no clear relationship between runs and at bats per game. As the number of at bats per game increases, the number of runs per game tends to increase. As the number of at bats per game increases, the number of runs per game tends to decrease.
Teams %>%
filter(yearID %in% 1961:2001 ) %>%
mutate(AB_per_game = AB/G, R_per_game = R/G) %>%
ggplot(aes(AB_per_game, R_per_game)) +
geom_point(alpha = 0.5)
0/1 point (graded)
Use the filtered Teams data frame from Question 6. Make a scatterplot of win rate (number of wins per game) versus number of fielding errors (E) per game. Which of the following is true? There is no relationship between win rate and errors per game. As the number of errors per game increases, the win rate tends to increase. As the number of errors per game increases, the win rate tends to decrease.This is the answer
library(dplyr) # this is for pipe %>%
library(ggplot2)
library(Lahman) # this is for Teams
#library(tidyverse) # this one includes dplyr, ggplot2 and many others
Teams %>% filter(yearID %in% 1961:2001 ) %>%
mutate(number_of_wins_per_game = W/G, fielding_errors_per_game = E/G) %>%
ggplot(aes(number_of_wins_per_game, fielding_errors_per_game)) +
geom_point(alpha = 0.3)
# Explanation
When you examine the scatterplot, you can see a clear trend towards
decreased win rate with increasing number of errors per game
(before I wa using big scatter markersize). The following
code can be used to make the scatterplot:
Teams %>%
filter(yearID %in% 1961:2001) %>%
mutate(win_rate = W / G, E_per_game = E / G) %>%
ggplot(aes(win_rate, E_per_game)) +
geom_point(alpha = 0.5)
Teams %>%
summarise(cor(W/G, E/G))
## cor(W/G, E/G)
## 1 -0.2158873
1/1 point (graded)
Use the filtered Teams data frame from Question 6. Make a scatterplot of triples (X3B) per game versus doubles (X2B) per game. Which of the following is true? There is no clear relationship between doubles per game and triples per game. As the number of doubles per game increases, the number of triples per game tends to increase. As the number of doubles per game increases, the number of triples per game tends to decrease.
Teams %>%
filter(yearID %in% 1961:2001 ) %>%
mutate(triple_per_game = X3B/G, double_per_game = X2B/G) %>%
ggplot(aes(triple_per_game, double_per_game)) +
geom_point(alpha = 0.5)
Ask your questions or make your comments about Baseball as a Motivating Example here! Remember, one of the best ways to reinforce your own learning is by explaining something to someone else, so we encourage you to answer each other’s questions (without giving away the answers, of course).
Some reminders:
Search the discussion board before posting to see if someone else has asked the same thing before asking a new question Please be specific in the title and body of your post regarding which question you’re asking about to facilitate answering your question. Posting snippets of code is okay, but posting full code solutions is not. If you do post snippets of code, please format it as code for readability. If you’re not sure how to do this, there are instructions in a pinned post in the “general” discussion forum.
library(HistData)
library(dplyr)
data("GaltonFamilies")
galton_heights <- GaltonFamilies %>%
filter(childNum == 1 & gender == 'male') %>%
select(father, childHeight) %>%
rename(son = childHeight)
galton_heights %>%
summarise(mean(father), sd(father), mean(son), sd(son))
## mean(father) sd(father) mean(son) sd(son)
## 1 69.09888 2.546555 70.45475 2.557061
# mean(father) sd(father) mean(son) sd(son)
#1 69.09888 2.546555 70.45475 2.557061
galton_heights %>%
ggplot(aes(father, son)) +
geom_point(alpha=0.5)
Up to now in this series, we have focused mainly on univariate
variables. However, in data science application it is very
common to be interested in the relationship between two or more
variables.
(Google this topic and explore a case study) We saw this in
our baseball example in which we were interested in the relationship,
for example, between bases on balls and runs. we’ll come back to this
example, but we introduce the concepts of correlation and regression
using a simpler example.
We’ll create a data set with the heights of fathers and the first sons. The actual data Galton used to discover and define regression. So we have the father and son height data. Suppose we were to summarize these data. Since both distributions are well approximated by normal distributions, we can use the two averages and two standard deviations as summaries. Here they are.
[][However, this summary fails to describe a very important characteristic of the data that you can see in this figure. The trend that the taller the father, the taller the son, is not described by the summary statistics of the average and the standard deviation. We will learn that the correlation coefficient is a summary of this trend].
[][Textbook link]
The corresponding textbook section is Case Study: is height hereditary? https://rafalab.github.io/dsbook/regression.html#case-study-is-height-hereditary
[][Key points]
Galton tried to predict sons' heights based on fathers' heights.
The mean and standard errors are insufficient for describing an important characteristic of the data: the trend that the taller the father, the taller the son.
The correlation coefficient is an informative summary of how two variables move together that can be used to predict one variable using the other.
Code
# create the dataset
library(tidyverse)
library(HistData)
data("GaltonFamilies")
set.seed(1983)
# ##################################################################################################################################
galton_heights <- GaltonFamilies %>%
filter(gender == "male") %>%
group_by(family) %>%
sample_n(1) %>%
ungroup() %>%
select(father, childHeight) %>%
rename(son = childHeight)
# means and standard deviations
galton_heights %>%
summarize(mean(father), sd(father), mean(son), sd(son))
## # A tibble: 1 × 4
## `mean(father)` `sd(father)` `mean(son)` `sd(son)`
## <dbl> <dbl> <dbl> <dbl>
## 1 69.1 2.55 69.2 2.71
# scatterplot of father and son heights
galton_heights %>%
ggplot(aes(father, son)) +
geom_point(alpha = 0.5)
library(dplyr)
library(Lahman)
library(HistData)
data("GaltonFamilies")
galton_heights <- GaltonFamilies %>%
filter(childNum == 1 & gender == 'male') %>%
select(father, childHeight) %>%
rename(son=childHeight)
galton_heights %>%
summarise(mean(father), sd(father), mean(son), sd(son))
## mean(father) sd(father) mean(son) sd(son)
## 1 69.09888 2.546555 70.45475 2.557061
galton_heights %>%
ggplot(aes(father, son)) +
geom_point(alpha=0.5)
The correlation coefficient is defined for a list of pairs–x1, y1 through xn, yn–with the following formula. Here, mu_x and mu_y are the averages of x and y, respectively. And sigma_x and sigma_y are the standard deviations. The Greek letter rho is commonly used in the statistics book, to denote this correlation. The reason is that rho is the Greek letter for r, the first letter of the word regression. Soon, we will learn about the connection between correlation and regression. [][To understand why this equation does, in fact, summarize how two variables move together], consider the i-th entry of x is xi minus mu_x divided by sigma_x SDs away from the average. Similarly, the yi–which is paired with the xi–is yi minus mu_y divided by sigma_y SDs away from the average y.
If x and y are unrelated, then the product of these two quantities will be positive. That happens when they are both positive or when they are both negative as often as they will be negative. That happens when one is positive and the other is negative, or the other way around. One is negative and the other one is positive. [][This will average to about 0. The correlation is this average.]
And therefore, unrelated variables will have a correlation of about 0. If instead the quantities vary together, then we are averaging mostly positive products. Because they’re going to be either positive times positive or negative times negative. And we get a positive correlation. If they vary in opposite directions, we get a negative correlation.
Another thing to know is that we can show mathematically that the correlation is always between negative 1 and 1. To see this, consider that we can have higher correlation than when we compare a list to itself. That would be perfect correlation. In this case, the correlation is given by this equation, which we can show is equal to 1. A similar argument with x and its exact opposite, negative x, proves that the correlation has to be greater or equal to negative 1. So it’s between minus 1 and 1.
To see what data looks like for other values of rho, here are six examples of pairs with correlations ranging from negative 0.9 to 0.99. When the correlation is negative, we see that they go in opposite direction. As x increases, y decreases. When the correlation gets either closer to 1 or negative 1, we see the clot of points getting thinner and thinner. When the correlation is 0, we just see a big circle of points.
[][Textbook link]
This video corresponds to the correlation coefficient section of the textbook. https://rafalab.github.io/dsbook/regression.html#the-correlation-coefficient
[][Key points]
[][* The correlation coefficient (rho) is defined for a list of pairs *]
(x_1, y_1), ..., (x_n, y_n)
as the product of the standardized values:
((x_i - mu_x)/Sigma_x) * ((y_i - mu_y)/Sigma_y)
[][* The correlation coefficient essentially conveys how two variables move together.*] [][* The correlation coefficient is always between -1 and 1.*]
Code
rho <- mean(scale(x)*scale(y)) galton_heights %>% summarize(r = cor(father, son)) %>% pull(r)
alt text here.
alt text here.
alt text here.
alt text here.
alt text here.
alt text here.
alt text here.
alt text here.
library(Lahman)
library(dplyr)
library(HistData)
data("GaltonFamilies")
galton_heights <- GaltonFamilies %>%
filter(childNum == 1 & gender == "male") %>%
select(father, childHeight) %>%
rename(son = childHeight)
galton_heights %>%
summarise(cor(father, son))
## cor(father, son)
## 1 0.5007248
alt text here.
Before we continue describing regression, let’s go over a reminder about random variability. In most data science applications, we do not observe the population, but rather a sample. As with the average and standard deviation, the sample correlation is the most commonly used estimate of the population correlation. This implies that the correlation we compute and use as a summary is a random variable. As an illustration, let’s assume that the 179 pairs of fathers and sons is our entire population. A less fortunate geneticist can only afford to take a random sample of 25 pairs. The sample correlation for this random sample can be computed using this code. Here, the variable R is the random variable.
We can run a monte-carlo simulation to see the distribution of this
random variable. Here, we recreate R 1000 times, and plot its histogram.
We see that the expected value is the population
correlation, the mean of these Rs is 0.5, and that it has a
relatively high standard error relative to its size, SD 0.147. This is
something to keep in mind when interpreting correlations. [][It
is a random variable, and it can have a pretty large standard
error]. Also note that because the sample correlation
is an average of independent draws
(independent? average? how?), the Central Limit Theorem
actually applies. [][Therefore, for a large enough sample size N,
the distribution of these Rs is approximately normal].
The expected value we know is the population correlation. The
standard deviation is somewhat more complex to derive, but this is the
actual formula here. In our example, N equals to 25, does not appear to
be large enough to make the approximation a good one
(how to identify a good one ??? Should the standard deviation equal to a normal distribution or something ???),
as we see in this QQ-plot.
[][Textbook link]
This video corresponds to the textbook section titled: Sample correlation is a random variable. https://rafalab.github.io/dsbook/regression.html#sample-correlation-is-a-random-variable
[][Key points]
The correlation that we compute and use as a summary is a random variable.
When interpreting correlations, it is important to remember that correlations derived from samples are estimates containing uncertainty.
Because the sample correlation is an average of independent draws, the central limit theorem applies.
Code
library(dplyr)
library(Lahman)
library(HistData)
data("GaltonFamilies")
galton_heights <- GaltonFamilies %>%
filter(childNum == 1 & gender == 'male') %>%
select(father, childHeight) %>%
rename(son=childHeight)
# compute sample correlation
R <- sample_n(galton_heights, 25, replace = TRUE) %>%
summarize(r = cor(father, son))
R
## r
## 1 0.4787613
R$r
## [1] 0.4787613
# Monte Carlo simulation to show distribution of sample correlation
B <- 1000
N <- 25
R <- replicate(B, {
sample_n(galton_heights, N, replace = TRUE) %>%
summarize(r = cor(father, son)) %>%
pull(r)
})
qplot(R, geom = "histogram", binwidth = 0.05, color = I("black"))
# expected value and standard error
mean(R)
## [1] 0.4970997
sd(R)
## [1] 0.1512451
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# QQ-plot to evaluate whether N is large enough
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
data.frame(R) %>%
ggplot(aes(sample = R)) +
stat_qq() +
geom_abline(intercept = mean(R), slope = sqrt((1-mean(R)^2)/(N-2)))
library(dplyr)
library(ggplot2)
library(Lahman)
library(HistData)
data("GaltonFamilies")
galton_heights <- GaltonFamilies %>%
filter(childNum == 1 & gender == 'male') %>%
select(father, childHeight) %>%
rename(son=childHeight)
# compute sample correlation
R <- sample_n(galton_heights, 25, replace = TRUE) %>%
summarize(r = cor(father, son))
R
## r
## 1 0.685367
R$r
## [1] 0.685367
# Monte Carlo simulation to show distribution of sample correlation
B <- 1000
N <- 25
R <- replicate(B, {
sample_n(galton_heights, N, replace = TRUE) %>%
summarize(r = cor(father, son)) %>%
pull(r)
})
qplot(R, geom = "histogram", bins = 15, color = I("black"))
# expected value and standard error
mean(R)
## [1] 0.4974826
sd(R)
## [1] 0.1461927
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# QQ-plot to evaluate whether N is large enough
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
data.frame(R) %>%
ggplot(aes(sample = R)) +
stat_qq() +
geom_abline(intercept = mean(R), slope = sqrt((1-mean(R)^2)/(N-2)))
library(Lahman)
library(dplyr)
library(HistData)
library(stats)
data("GaltonFamilies")
galton_heights <- GaltonFamilies %>%
filter(childNum == 1 & gender == "male") %>%
select(father, childHeight) %>%
rename(son = childHeight)
set.seed(0)
R <- sample_n(galton_heights, 25, replace=TRUE) %>%
summarize(cor(father, son))
R
## cor(father, son)
## 1 0.5889351
library(dplyr)
library(ggplot2)
B <- 1000
N <- 25
R <- replicate(B, {
sample_n(galton_heights, N, replace = TRUE) %>%
summarize(r = cor(father, son)) %>% .$r
})
# ========================================================================================================
# Using $ Operator to Access Data Frame Column.
# Using . to do what
data.frame(R) %>%
ggplot(aes(R)) + geom_histogram(binwidth=0.05, color='black')
mean(R)
## [1] 0.5040874
sd(R)
## [1] 0.1439084
library(Lahman)
library(dplyr)
library(HistData)
library(stats)
data("GaltonFamilies")
galton_heights <- GaltonFamilies %>%
filter(childNum == 1 & gender == "male") %>%
select(father, childHeight) %>%
rename(son = childHeight)
set.seed(0)
R <- sample_n(galton_heights, 25, replace=TRUE) %>%
summarize(mean(father), sd(father), mean(son), sd(son))
R
## mean(father) sd(father) mean(son) sd(son)
## 1 69.232 2.302303 70.632 2.107273
this implies that the correlation we compte and used as a summary is a random variable.png
because the sample correlation is an average of independent draws, the centeral limit theorem applies.png
[][Read this and think]
alt text here.
alt text here
1/1 point (graded) While studying heredity, Francis Galton developed what important statistical concept? Standard deviation Normal distribution Correlation Probability
1/1 point (graded) The correlation coefficient is a summary of what? The trend between two variables The dispersion of a variable The central tendency of a variable The distribution of a variable correct
1/1 point (graded)
Below is a scatter plot showing the relationship between two variables, x and y. Scatter plot of relationship between x (plotted on the x-axis) and y (plotted on the y-axis). y-axis values range from -3 to 3; x-axis values range from -3 to 3. Points are fairly well distributed in a tight band with a range from approximately (-2, 2) to (3, -3).
From this figure, the correlation between
x and y appears to be about: -0.9 -0.2 0.9 2
1/1 point (graded)
Instead of running a Monte Carlo simulation with a sample size of 25 from the 179 father-son pairs described in the videos, we now run our simulation with a sample size of 50. Would you expect the mean of our sample correlation to increase, decrease, or stay approximately the same? Increase Decrease Stay approximately the same
1/1 point (graded)
Instead of running a Monte Carlo simulation with a sample size of 25 from the 179 father-son pairs described in the videos, we now run our simulation with a sample size of 50. Would you expect the standard deviation of our sample correlation to increase, decrease, or stay approximately the same? Increase Decrease Stay approximately the same
1/1 point (graded) If X and Y are completely independent, what do you expect the value of the correlation coefficient to be? -1 -0.5 0 0.5 1 Not enough information to answer the question
1/1 point (graded)
Load the Lahman library. Filter the Teams data frame to include years from 1961 to 2001. What is the correlation coefficient between number of runs per game and number of at bats per game? correct 0.6580976
Loading You have used 1 of 10 attempts Some
1/1 point (graded)
Use the filtered Teams data frame from Question 7. What is the correlation coefficient between win rate (number of wins per game) and number of errors per game? correct -0.3396947
Loading You have used 1 of 10 attempts Some
1/1 point (graded)
Use the filtered Teams data frame from Question 7. What is the correlation coefficient between doubles (X2B) per game and triples (X3B) per game? correct -0.01157404
Loading You have used 1 of 10 attempts Some
library(Lahman)
library(tidyverse)
#library(dplyr)
#library(ggplot2)
Teams %>% filter(yearID %in% 1961:2001 ) %>%
mutate(number_of_runs_per_game = R/G, number_of_bats_per_game = AB/G) %>%
ggplot2::ggplot(aes(number_of_runs_per_game, number_of_bats_per_game)) + geom_point(alpha = 0.5)
# https://stackoverflow.com/questions/60901319/r-language-registered-s3-method-overwritten-by-data-table
library(Lahman)
library(tidyverse)
Teams %>% filter(yearID %in% 1961:2001 ) %>%
summarize(cor(R/G, AB/G))
## cor(R/G, AB/G)
## 1 0.6580976
library(Lahman)
library(tidyverse)
Teams %>% filter(yearID %in% 1961:2001 ) %>%
summarize(cor(W/G, E/G))
## cor(W/G, E/G)
## 1 -0.3396947
library(Lahman)
library(tidyverse)
Teams %>% filter(yearID %in% 1961:2001 ) %>%
summarize(cor(X2B/G, X3B/G))
## cor(X2B/G, X3B/G)
## 1 -0.01157404
Ask your questions or make your comments about Correlation here! Remember, one of the best ways to reinforce your own learning is by explaining something to someone else, so we encourage you to answer each other’s questions (without giving away the answers, of course).
Some reminders:
Search the discussion board before posting to see if someone else has asked the same thing before asking a new question
Please be specific in the title and body of your post regarding which question you're asking about to facilitate answering your question.
Posting snippets of code is okay, but posting full code solutions is not.
If you do post snippets of code, please format it as code for readability. If you're not sure how to do this, there are instructions in a pinned post in the "general" discussion forum.
Correlation is not always a good summary of the relationship between two variables. A famous example used to illustrate this are the following for artificial data sets, referred to as Anscombe’s quartet. All of these pairs have a correlation of 0.82. Correlation is only meaningful in a particular context. To help us understand when it is that correlation is meaningful as a summary statistic, we’ll try to predict the son’s height using the father’s height. This will help motivate and define linear regression. We start by demonstrating how correlation can be useful for prediction.
[][Suppose we are asked to guess the height of a randomly selected son]. Because of the distribution of the son height is approximately normal, we know that the average height of 70.5 inches is a value with the highest proportion and would be the prediction with the chances of minimizing the error. [][But what if we are told that the father is 72 inches?] Do we still guess 70.5 inches for the son? The father is taller than average, specifically he is 1.14 standard deviations taller than the average father. So shall we predict that the son is also 1.14 standard deviations taller than the average son? It turns out that this would be an overestimate.
To see this, we look at all the sons with fathers who are about 72 inches. We do this by stratifying the father’s height. We call this a conditional average, since we are computing the average son height conditioned on the father being 72 inches tall. A challenge when using this approach in practice is that we don’t have many fathers that are exactly 72. In our data set, we only have eight. If we change the number to 72.5, we would only have one father who is that height. This would result in averages with large standard errors, and they won’t be useful for prediction for this reason. But for now, what we’ll do is we’ll take an approach of creating strata of fathers with very similar heights. Specifically, we will round fathers’ heights to the nearest inch. This gives us the following prediction for the son of a father that is approximately 72 inches tall. We can use this code and get our answer, which is 71.84. This is 0.54 standard deviations larger than the average son, a smaller number than the 1.14 standard deviations taller that the father was above the average father. Stratification followed by box plots lets us see the distribution of each group. Here is that plot.
We can see that the centers of these groups are increasing with
height, not surprisingly. The means of each group appear to follow a
linear relationship
(then why we supposed to use a scatter plot to explore relationship between two variales in Exploratory Data Analysis in Python DataCamp course ? I mean many dots covering on each other, Using a boxplot would avoid this condition ???).
We can make that plot like this, with this code. See the plot and notice
that this appears to follow a line. The slope of this line appears to be
about 0.5, which happens to be the correlation between father and son
heights. This is not a coincidence. To see this
connection, let’s plot the standardized heights
(Why??? standardize? can we just use the boxplot or the grouped mean?)
against each other, son versus father, with a line that has a slope
equal to the correlation. (Think, Think, Think) [][****Read
the important comments in below standardized plot, and think about how
its all related to each other****]
Here’s the code. Here’s a plot. This line is what we call the
regression line. In a later video, we will describe Galton’s theoretical
justification for using this line to estimate conditional means. Here,
we define it and compute it for the data at hand. The regression line
for two variables, x and y, tells us that [][for every
standard deviation (sigma x) increase above the average (mu x). For x, y
grows rho standard deviations (sigma y) above the average (mu
y).]. The formula for the regression line is therefore
this one (Think, Think, Think)[][How does this comes
out???]. If there’s perfect correlation, we predict an increase that is
the same number of SDs. If there’s zero correlation, then we don’t use x
at all for the prediction of y. For values between 0 and 1, the
prediction is somewhere in between. If the correlation is negative, we
predict a reduction, instead of an increase.
It is because when the correlation is positive but lower
than the one, that we predict something closer to the mean
(it has to be the normal distribution, enough sized sample),
that we call this regression. The son regresses to the average
height.
In fact, the title of Galton’s paper was “Regression Towards Mediocrity in Hereditary Stature.” Note that if we write this in the standard form of a line, y equals b plus mx, where b is the intercept and m is the slope, the regression line has slope rho times sigma y, divided by sigma x, and intercept mu y, minus mu x, times the slope. So if we standardize the variable so they have average 0 and standard deviation 1. Then the regression line has intercept 0 and slope equal to the correlation rho. Let’s look at the original data, father son data, and add the regression line. We can compute the intercept and the slope using the formulas we just derived. Here’s a code to make the plot with the regression line. If we plot the data in standard units, then, as we discussed, the regression line as intercept 0 and slope rho. Here’s the code to make that plot.
We started this discussion by saying that we wanted to use the
conditional means to predict the heights of the sons. But then we
realized that there were very few data points in each strata. When we
did this approximation of rounding off the height of the fathers
(the boxplot), we found that these conditional means appear
to follow a line. And we ended up with the regression line
(that is if we standardize both variables - father, son).
****So the regression line gives us the prediction****. An advantage of
using the regression line is that we used all the data to estimate
just two parameters, the slope and the intercept. This makes it much
more stable. When we do conditional means, we had fewer data
points, which made the estimates have a large standard error, and
therefore be unstable. So this is going to give us a much more stable
prediction using the regression line. However, are we justified in using
the regression line to predict? Galton gives us the answer.
[][Textbook link]
There are three links to relevant sections of the textbook for this video:
correlation is not always a useful summary
https://rafalab.github.io/dsbook/regression.html#correlation-is-not-always-a-useful-summary
conditional expectation
https://rafalab.github.io/dsbook/regression.html#conditional-expectation
the regression line
https://rafalab.github.io/dsbook/regression.html#the-regression-line
[][Key points]
[][* Correlation is not always a good summary of the relationship between two variables.*] The general idea of conditional expectation is that we stratify a population into groups and compute summaries in each group. A practical way to improve the estimates of the conditional expectations is to define strata of with similar values of x. If there is perfect correlation, the regression line predicts an increase that is the same number of SDs for both variables. If there is 0 correlation, then we don’t use x at all for the prediction and simply predict the average. For values between 0 and 1, the prediction is somewhere in between. If the correlation is negative, we predict a reduction instead of an increase.
Code
# number of fathers with height 72 or 72.5 inches
sum(galton_heights$father == 72)
## [1] 8
sum(galton_heights$father == 72.5)
## [1] 1
# predicted height of a son with a 72 inch tall father
conditional_avg <- galton_heights %>%
filter(round(father) == 72) %>%
summarize(avg = mean(son)) %>%
pull(avg)
conditional_avg
## [1] 71.83571
# stratify fathers' heights to make a boxplot of son heights
galton_heights %>%
mutate(father_strata = factor(round(father))) %>%
ggplot(aes(father_strata, son)) +
geom_boxplot() +
geom_point()
# center of each boxplot
galton_heights %>%
mutate(father = round(father)) %>%
group_by(father) %>%
summarize(son_conditional_avg = mean(son)) %>%
ggplot(aes(father, son_conditional_avg)) +
geom_point()
# calculate values to plot regression line on original data
mu_x <- mean(galton_heights$father)
mu_y <- mean(galton_heights$son)
s_x <- sd(galton_heights$father)
s_y <- sd(galton_heights$son)
r <- cor(galton_heights$father, galton_heights$son)
m <- r * s_y/s_x
b <- mu_y - m*mu_x
r
## [1] 0.5007248
m
## [1] 0.5027904
b
## [1] 35.71249
# add regression line to plot
galton_heights %>%
ggplot(aes(father, son)) +
geom_point(alpha = 0.5) +
geom_abline(intercept = b, slope = m)
image here
image here
image here
image here
image here
conditional_avg <- galton_heights %>%
filter(round(father)==72) %>%
summarise(avg=mean(son)) %>%
.$avg
conditional_avg
## [1] 71.83571
galton_heights %>%
mutate(father_strata = factor(round(father))) %>%
ggplot(aes(father_strata, son)) +
geom_boxplot() +
geom_point()
galton_heights %>%
mutate(father=round(father)) %>%
group_by(father) %>%
summarise(son_conditional_avg = mean(son)) %>%
ggplot(aes(father, son_conditional_avg)) +
geom_point()
image here
r <- galton_heights %>%
summarise(r = cor(father, son)) %>%
.$r
galton_heights %>%
mutate(father = round(father)) %>%
group_by(father) %>%
summarise(son = mean(son)) %>%
mutate(z_father = scale(father), z_son = scale(son)) %>%
ggplot2::ggplot(aes(z_father, z_son)) +
geom_point() +
geom_abline(intercept = 0, slope = r)
# **Why do we use scale function in R?**
# When we want to scale the values in several columns of a data frame so that each column has a mean of 0 and a standard deviation of 1, we usually use the scale() function.
r <- galton_heights %>%
summarise(r = cor(father, son)) %>%
.$r
galton_heights %>%
mutate(father = round(father)) %>%
group_by(father) %>%
summarise(son = mean(son)) %>%
# mutate(z_father = father, z_son = son) %>%
# ggplot2::ggplot(aes(z_father, z_son)) +
ggplot(aes(father, son)) +
geom_point() +
geom_abline(intercept = 35.5, slope = r)
# So the reason for standardize is the intercept would be easy to define, otherwise we have to guess until its 35 or something, and also the two plot seems different, this one and the standardized one. Recall how correlation is calculated, and how scale() function standardizing the data: (x - mean(x)) / sd(x)
image here
a <- galton_heights %>%
mutate(father = round(father)) %>%
group_by(father) %>%
summarise(son = mean(son)) %>%
mutate(z_father = scale(father), z_son = scale(son))
# https://stackoverflow.com/questions/20256028/understanding-scale-in-r
a
## # A tibble: 15 × 4
## father son z_father[,1] z_son[,1]
## <dbl> <dbl> <dbl> <dbl>
## 1 62 65.2 -1.70 -2.14
## 2 64 68.1 -1.28 -1.00
## 3 65 67.6 -1.06 -1.21
## 4 66 69.2 -0.850 -0.566
## 5 67 70.0 -0.638 -0.262
## 6 68 69.2 -0.425 -0.572
## 7 69 71.2 -0.213 0.231
## 8 70 71.2 0 0.198
## 9 71 71.5 0.213 0.341
## 10 72 71.8 0.425 0.469
## 11 73 71.5 0.638 0.350
## 12 74 75.2 0.850 1.82
## 13 75 71.2 1.06 0.204
## 14 76 73.5 1.28 1.13
## 15 78 73.2 1.70 1.01
Read the below statement
image here
image here
image here
image here
image here
mu_x <- mean(galton_heights$father)
mu_y <- mean(galton_heights$son)
s_x <- sd(galton_heights$father)
s_y <- sd(galton_heights$son)
r <- cor(galton_heights$father, galton_heights$son)
m <- r * s_y / s_x
b <- mu_y - m * mu_x
r
## [1] 0.5007248
m
## [1] 0.5027904
b
## [1] 35.71249
galton_heights %>%
ggplot(aes(father, son)) +
geom_point(alpha=0.3) +
geom_abline(intercept = b, slope=m)
galton_heights %>%
ggplot(aes(scale(father), scale(son))) +
geom_point(alpha=0.3) +
geom_abline(intercept = 0, slope = r)
image here
Correlation and the regression line are widely used summary statistics. [][But it is often misused or misinterpreted]. Anscombe’s example provided toy example of data sets in which summarizing with a correlation would be a mistake. But we also see it in the media and in scientific literature as well.
The main way we motivate the use of correlation involve
what is called the bivariate normal distribution.
[][When a pair of random variables is approximated by a
bivariate normal distribution, the scatterplot looks like ovals, like
American footballs]. They can be thin. That’s when they
have high correlation. All the way up to a circle shape when they have
no correlation. We saw some examples previously. Here they are again. A
more technical way to define the bivariate normal distribution is the
following. First, this distribution is defined for
pairs (like the father - son paris used earlier).
So we have two variables, x and y. And they have paired values. They are
going to be bivariate normally distributed if the following happens.
[][If x is a normally distributed random variable, and y is also
a normally distributed random variable–and for any grouping of x that we
can define, say, with x being equal to some predetermined value, which
we call here in this formula little x–then the y’s in that group are
approximately normal as well.] If this happens, then the pair is
approximately bivariate normal. When we fix x in this way, we then refer
to the resulting distribution of the y’s in the group–defined by setting
x in this way–as the conditional distribution of y given x.
(Remember we did this before, fix father 72 inches, but the sample size is too small)
We write the notation like this for the conditional distribution and the
conditional expectation. If we think the height data is
well-approximated by the bivariate normal distribution, then we should
see the normal approximation hold for each grouping.
Here, we stratify the son height by the standardized father heights
and see that the assumption appears to hold. Here’s the code that gives
us the desired plot. Now, we come back to defining correlation. Galton
showed– using mathematical statistics– that when two variables
follow a bivariate normal distribution, then for any given x the
expected value of the y in pairs for which x is set at that value is mu
y plus rho x minus mu x divided by sigma x times sigma y. Note
that this is a line with slope rho times sigma y divided by sigma x and
intercept mu y minus n times mu x. And therefore, this is the same as
the regression line we saw in a previous video.
(Now you must understand the slope difference between standalized vairable and non-standarilized vaiable in father son case we did in prevrious chapter)
That can be written like this. So in summary, if our data is
approximately bivariate, then the conditional expectation–which is the
best prediction for y given that we know the value of x–is given by the
regression line.
[][Textbook link]
This video corresponds to the textbook section on the bivariate normal distribution (advanced). https://rafalab.github.io/dsbook/regression.html#bivariate-normal-distribution-advanced
[][Key points]
When a pair of random variables are approximated by the bivariate normal distribution, scatterplots look like ovals. They can be thin (high correlation) or circle-shaped (no correlation).
When two variables follow a bivariate normal distribution, computing the regression line is equivalent to computing conditional expectations.
We can obtain a much more stable estimate of the conditional expectation by finding the regression line and using it to make predictions.
Code
galton_heights %>%
mutate(z_father = round((father - mean(father)) / sd(father))) %>%
filter(z_father %in% -2:2) %>%
ggplot() +
stat_qq(aes(sample = son)) +
facet_wrap( ~ z_father)
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# What are we doing here?
# https://towardsdatascience.com/q-q-plots-explained-5aa8495426c0?gi=409ca0cd036
image here
image here
image here
image here
images here
images here
image here
image here
image here
galton_heights %>%
#mutate(z_father=round((father-mean(father))/sd(father))) %>% # we can scale it manually or use scale() function
mutate(z_father = round(scale(father))) %>%
filter(z_father %in% -2:2) %>%
ggplot2::ggplot() +
stat_qq(aes(sample=son)) +
facet_wrap(~z_father)
# What does a QQ plot show?
# The purpose of the quantile-quantile (QQ) plot is to show if two data sets come from the same distribution. Plotting the first data set's quantiles along the x-axis and plotting the second data set's quantiles along the y-axis is how the plot is constructed.
# https://math.illinois.edu/system/files/inline-files/Proj9AY1516-report2.pdf
# Where does above formula comes from???
image here
image here
image here
image here
image here
So this video is recorded long times ago, how interesting)The equation shown at 0:10 is for the standard deviation of the conditional distribution, not the variance. **The variance is the standard deviation squared**. See the notes below the video for more clarification.
(The theory we’ve been
describing also tells us that the standard deviation of the conditional
distribution that we described in a previous video is Var
(
Correction: The equation shown at 0:10 is for the standard deviation of the conditional distribution, not the variance)
of Y given X equals sigma y times the square root of 1 minus rho
squared. This is where statements like x explains such and such percent
of the variation in y comes from. Note that the variance of y is sigma
squared. That’s where we start. If we condition on x, then the
variance goes down to 1 minus rho squared times sigma squared
y. So from there, we can compute how much the variance has gone
down. It has gone down by rho squared times 100%. So the correlation and
the amount of variance explained are related to each other. But it is
important to remember that the variance explained statement only makes
sense when the data is by a bivariate normal distribution.
[][Read all the course material and thinking, then trying to answer your questions, its a good way of learning, and thinking]
[][Textbook link]
This video corresponds to the textbook section on variance explained. # https://rafalab.github.io/dsbook/regression.html#variance-explained
[][Key points]
Conditioning on a random variable X can help to reduce variance of response variable Y.
The standard deviation of the conditional distribution is
SD(Y|X=x) = Sigma_y * SquaredRoot(1 - rho**2),
which is smaller than the standard deviation without conditioning sigma_y.
Because variance is the standard deviation squared, the variance of the conditional distribution is:
Var(Y\X=x) = Sigma_y^(2) * (1 - rho**2).
In the statement "X explains such and such percent of the variability," the percent value refers to the variance. The variance decreases by \(\rho^2\) percent.
The “variance explained” statement only makes sense when the data is approximated by a bivariate normal distribution.
# Where does above equation comes from ???
image here
image here
image here
image here
image here
We computed a regression line to predict the son’s height from the
father’s height. We used these calculations–here’s the code–to get the
slope and the intercept. This gives us the function that the conditional
expectation of y given x is 35.7 plus 0.5 times x. So, what if we wanted
to predict the father’s height based on the son’s? It is important to
know that this is not determined by computing the inverse function of
what we just saw, which would be this equation here. [][We need to
compute the expected value of x given y]. This gives us another
regression function altogether, with slope and intercept computed like
this. (How did this comes from??) So now we get that the
expected value of x given y, or the expected value of the father’s
height given the son’s height, is equal to 34 plus 0.5 y, a different
regression line.
So in summary, it’s important to remember that the regression line comes from computing expectations, and these give you two different lines, depending on if you compute the expectation of y given x or x given y.
[][Textbook link]
The link to the corresponding section of the textbook is warning: there are two regression lines. https://rafalab.github.io/dsbook/regression.html#warning-there-are-two-regression-lines
[][Key point] There are two different regression lines depending on whether we are taking the expectation of Y given X or taking the expectation of X given Y.
Code
# compute a regression line to predict the son's height from the father's height
mu_x <- mean(galton_heights$father)
mu_y <- mean(galton_heights$son)
s_x <- sd(galton_heights$father)
s_y <- sd(galton_heights$son)
r <- cor(galton_heights$father, galton_heights$son)
m_1 <- r * s_y / s_x
b_1 <- mu_y - m_1*mu_x
# compute a regression line to predict the father's height from the son's height
m_2 <- r * s_x / s_y
b_2 <- mu_x - m_2*mu_y
m_1
## [1] 0.5027904
b_1
## [1] 35.71249
m_2
## [1] 0.4986676
b_2
## [1] 33.96539
mu_x <- mean(galton_heights$father)
mu_y <- mean(galton_heights$son)
s_x <- sd(galton_heights$father)
x_y <- sd(galton_heights$son)
r <- cor(galton_heights$father, galton_heights$son)
m <- r * s_y/s_x # Thus the variance should be changed to variance**2, why made mistakes, poor Harvard
b <- mu_y - m*mu_x
m
## [1] 0.5027904
b
## [1] 35.71249
Song predict father wrong
compute expected value of x given y
m <- r * s_x/s_y
b <- mu_x - m*mu_y
m
## [1] 0.4986676
b
## [1] 33.96539
image here
two different lines depending on what you do
Look at the figure below. Scatter plot of son and father heights with
son heights on the y-axis and father heights on the x-axis. There is
also a regression line that runs from roughly (63,66) to (78,76). The
dots on the plot are scattered around the line. The slope of the
regression line in this figure is equal to what, in words? Slope
= (correlation coefficient of son and father heights) * (standard
deviation of sons’ heights / standard deviation of fathers’
heights) Slope = (correlation coefficient of son and father
heights) * (standard deviation of fathers’ heights / standard deviation
of sons’ heights) Slope = (correlation coefficient of son and father
heights) / (standard deviation of sons’ heights * standard deviation of
fathers’ heights) Slope = (mean height of fathers) - (correlation
coefficient of son and father heights * mean height of sons).
1 point possible (graded) Why does the regression line simplify to a line with intercept zero and slope rho when we standardize our x and y variables?
Try the simplification on your own first! When we standardize variables, both x and y will have a mean of one and a standard deviation of zero. When you substitute this into the formula for the regression line, the terms cancel out until we have the following equation: y_i = rho * x_i. When we standardize variables, both x and y will have a mean of zero and a standard deviation of one. When you substitute this into the formula for the regression line, the terms cancel out until we have the following equation: y_i = rho * x_i. When we standardize variables, both x and y will have a mean of zero and a standard deviation of one. When you substitute this into the formula for the regression line, the terms cancel out until we have the following equation: y_i = rho + x_i.
1 point possible (graded) What is a limitation of calculating conditional means?
Select ALL that apply. Each stratum we condition on (e.g., a specific father’s height) may not have many data points. Because there are limited data points for each stratum, our average values have large standard errors. Conditional means are less stable than a regression line. Conditional means are a useful theoretical tool but cannot be calculated.
1/1 point (graded) A regression line is the best prediction of Y given we know the value of X when: X and Y follow a bivariate normal distribution. Both X and Y are normally distributed. Both X and Y have been standardized. There are at least 25 X-Y pairs.
0/1 point (graded) Which one of the following scatterplots depicts an x and y distribution that is NOT well-approximated by the bivariate normal distribution?
I chose 3, but false, why.
The v-shaped distribution of points from the first plot means that the x and y variables do not follow a bivariate normal distribution.
****When a pair of random variables is approximated by a bivariate normal, the scatter plot looks like an oval**** (as in the 2nd, 3rd, and 4th plots) - [][it is okay if the oval is very round (as in the 3rd plot) or long and thin (as in the 4th plot)].
0/1 point (graded)
We previously calculated that the correlation coefficient between
fathers’ and sons’ heights is 0.5. Given this, what percent of the
variation in sons’ heights is explained by fathers’ heights? 0% 25%
50% 75% incorrect I choose 50% which is incorrect,
Think, Think, Think Answer Incorrect: Try again. [][When two variables
follow a bivariate normal distribution, the variation explained can be
calculated as rho^2 x
100](How does this comes from???).
1/1 point (graded)
Suppose the correlation between father and son’s height is 0.5, the standard deviation of fathers’ heights is 2 inches, and the standard deviation of sons’ heights is 3 inches. Given a one inch increase in a father’s height, what is the predicted change in the son’s height? 0.333 0.5 0.667 0.75 1 1.5 correct Answer Correct: Correct! TThe slope of the regression line is calculated by multiplying the correlation coefficient by the ratio of the standard deviation of son heights and standard deviation of father heights: var_son/var_father. (Note: here he means SD_son/SD_father, its a mistake) .
Comprehension Check due May 29, 2022 00:29 AWST
In the second part of this assessment, you’ll analyze a set of mother and daughter heights, also from GaltonFamilies.
Define female_heights, a set of mother and daughter heights sampled from GaltonFamilies, as follows:
set.seed(1989) #if you are using R 3.5 or earlier set.seed(1989, sample.kind=“Rounding”) #if you are using R 3.6 or later library(HistData) data(“GaltonFamilies”)
female_heights <- GaltonFamilies%>%
filter(gender == “female”) %>%
group_by(family) %>%
sample_n(1) %>%
ungroup() %>%
select(mother, childHeight) %>%
rename(daughter = childHeight)
set.seed(1989, sample.kind="Rounding") #if you are using R 3.6 or later
## Warning in set.seed(1989, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
library(HistData)
data("GaltonFamilies")
female_heights <- GaltonFamilies %>%
filter(gender == "female") %>%
group_by(family) %>%
sample_n(1) %>%
ungroup() %>%
select(mother, childHeight) %>%
rename(daughter = childHeight)
mean(female_heights$mother)
## [1] 64.125
sd(female_heights$mother)
## [1] 2.289292
mean(female_heights$daughter)
## [1] 64.28011
sd(female_heights$daughter)
## [1] 2.39416
cor(female_heights$mother, female_heights$daughter)
## [1] 0.3245199
mu_x <- mean(female_heights$mother)
mu_y <- mean(female_heights$daughter)
s_x <- sd(female_heights$mother)
s_y <- sd(female_heights$daughter)
r <- cor(female_heights$mother, female_heights$daughter)
m <- r * s_y/s_x
b <- mu_y - m*mu_x
m
## [1] 0.3393856
b
## [1] 42.51701
m*m
## [1] 0.1151826
outcome <- m * 60 + b
outcome
## [1] 62.88015
***Recall what we did in above courses***
mu_x <- mean(galton_heights$father)
mu_y <- mean(galton_heights$son)
s_x <- sd(galton_heights$father)
x_y <- sd(galton_heights$son)
r <- cor(galton_heights$father, galton_heights$son)
m <- r * s_y/s_x # Thus the variance should be changed to variance**2, why made mistakes, poor Harvard
b <- mu_y - m*mu_x
5/5 points (graded)
Calculate the mean and standard deviation of mothers’ heights, the mean and standard deviation of daughters’ heights, and the correlaton coefficient between mother and daughter heights. Mean of mothers’ heights correct 64.125
Loading Standard deviation of mothers’ heights correct 2.289292
Loading Mean of daughters’ heights correct 64.28011
Loading Standard deviation of daughters’ heights correct 2.39416
Loading Correlation coefficient correct 0.3245199
Loading You have used 1 of 10 attempts Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Correct (5/5 points)
3/3 points (graded)
Calculate the slope and intercept of the regression line predicting daughters’ heights given mothers’ heights. Given an increase in mother’s height by 1 inch, how many inches is the daughter’s height expected to change? Slope of regression line predicting daughters’ height from mothers’ heights correct 0.3393856
Loading Intercept of regression line predicting daughters’ height from mothers’ heights correct 42.51701
Loading Change in daughter’s height in inches given a 1 inch increase in the mother’s height correct 0.3393856
Loading You have used 1 of 10 attempts Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Correct (3/3 points)
1/1 point (graded) What percent of the variability in daughter heights is explained by the mother’s height?
Report your answer as a value between 0 and 100. Do NOT include the percent symbol (%) in your submission. correct 11
Loading You have used 4 of 10 attempts Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Correct (1/1 point)
1/1 point (graded)
A mother has a height of 60 inches. Using the regression formula, what is the conditional expected value of her daughter’s height given the mother’s height? correct 62.88015
Loading You have used 1 of 10 attempts Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Correct (1/1 point)
Ask your questions or make your comments about Stratification and Variance Explained here! Remember, one of the best ways to reinforce your own learning is by explaining something to someone else, so we encourage you to answer each other’s questions (without giving away the answers, of course).
Some reminders:
Search the discussion board before posting to see if someone else has asked the same thing before asking a new question
Please be specific in the title and body of your post regarding which question you're asking about to facilitate answering your question.
Posting snippets of code is okay, but posting full code solutions is not.
If you do post snippets of code, please format it as code for readability. If you're not sure how to do this, there are instructions in a pinned post in the "general" discussion forum.
In the Linear Models section, you will learn how to do linear regression.
After completing this section, you will be able to:
[][* Use multivariate regression to adjust for confounders.] Write linear models to describe the relationship between two or more variables. [][* Calculate the least squares estimates for a regression model using the lm function.] Understand the differences between tibbles and data frames. Use the do() function to bridge R functions and the tidyverse. Use the tidy(), glance(), and augment() functions from the broom package. Apply linear regression to measurement error models.
This section has four parts: Introduction to Linear Models, Least Squares Estimates, Tibbles, do, and broom, and Regression and Baseball. There are comprehension checks at the end of each part, along with an assessment on linear models at the end of the whole section for Verified learners only.
We encourage you to use R to interactively test out your answers and further your own learning. If you get stuck, we encourage you to search the discussion boards for the answer to your issue or ask us for help!
In a previous video, we found that the slope of the regression line
for predicting runs from bases on balls was 0.735. So, does this
mean that if we go and hire low salary players with many bases on balls
that increases the number of walks per game by 2 for our team? Our team
will score 1.47 more runs per game? [][We are again reminded
that association is not causation]. The data does provide strong
evidence that a team with 2 more bases on balls per game than the
average team scores 1.47 more runs per game, but this does not mean that
bases on balls are the cause. If we do compute the regression line slope
for singles, we get 0.449, a lower value. Note that a single gets you to
first base just like a base on balls. Those that know a little bit more
about baseball will tell you that with a single, runners that are on
base have a better chance of scoring than with a base on balls
(DId you see the logic conflict here).
So, how can base on balls be more predictive of runs? The reason this happens is because of [][confounding]. Note the correlation between homeruns, bases on balls, and singles. We see that the correlation between bases on balls and homeruns is quite high compared to the other two pairs. [][It turns out that pitchers, afraid of homeruns, will sometimes avoid throwing strikes to homerun hitters]. As a result, homerun hitters tend to have more bases on balls. Thus, a team with many homeruns will also have more bases on balls than average, and as a result, it may appear that bases on balls cause runs. But it is actually the homeruns that caused the runs.
In this case, we say that bases on balls are confounded with homeruns. [][But could it be that bases on balls still help? To find out, we somehow have to adjust for the homerun effect. Regression can help with this]. =============================================================================================================================
library(Lahman)
library(tidyverse)
#library(dslabs)
Teams %>%
filter(yearID %in% 1961:2001) %>%
mutate(Singles = (H - HR - X2B - X3B)/G, BB = BB/G, HR = HR/G) %>%
summarize(cor(BB, HR), cor(Singles, HR), cor(BB, Singles))
## cor(BB, HR) cor(Singles, HR) cor(BB, Singles)
## 1 0.4039313 -0.1737435 -0.05603822
[][Did you see the logic conflict here ??? Single-R slope
< BB-R slope, but Single gives runner better chance. Now
interesting]++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
To try to determine if bases on balls is still useful for creating runs, a first approach is to keep home runs fixed at a certain value and then examine the relationship between runs and bases on balls. As we did when we stratified fathers by rounding to the closest inch, here, we can stratify home runs per game to the closest 10th. [][We filtered our strata with few points. We use this code to generate an informative data set. And then, we can make a scatter plot for each strata]. A scatterplot of runs versus bases on balls. This is what it looks like. Remember that the regression slope for predicting runs with bases on balls when we ignore home runs was 0.735.
[][But once we stratify by home runs, these slopes are substantially reduced]. We can actually see what the slopes are by using this code. We stratify by home run and then compute the slope using the formula that we showed you previously. These values are closer to the slope we obtained from singles, which is 0.449. Which is more consistent with our intuition. Since both singles and bases on ball get us to first base, they should have about the same predictive power.
Now, although our understanding of the application– our understanding of baseball–tells us that home runs cause bases on balls and not the other way around, we can still check if, after stratifying by base on balls, we still see a home run effect or if it goes down. We use the same code that we just used for bases on balls. But now, we swap home runs for bases on balls to get this plot. In this case, the slopes are the following. You can see they are all around 1.5, 1.6, 1.7. So they do not change that much from the original slope estimate, which was 1.84.
[][Regardless, it seems that if we stratify by home runs, we have an approximately bivariate normal distribution for runs versus bases on balls. Similarly, if we stratify by bases on balls, we have an approximately normal bivariate distribution for runs versus home runs. So what do we do?] It is somewhat complex to be computing regression lines for each strata. We’re essentially fitting this model that you can see in this equation with the slopes for x1 changing for different values of x2 and vice versa. Here, x1 is bases on balls. And x2 are home runs. Is there an easier approach? (So we have to take whole influence aerospace into a equation???)
[][Note that if we take random variability into account, the
estimated slopes by strata don’t appear to change that much]
(Sorry what ???). If these slopes are in fact the
same, this implies that this function beta 1 of x2 and the other
function beta 2 of x1 are actually constant. Which, in turn,
implies that the expectation of runs condition on home runs and bases on
balls can be written in this simpler model. This model implies that if
the number of home runs is fixed, we observe a linear relationship
between runs and bases on balls. And that the slope of that relationship
does not depend on the number of home runs. Only the slope changes as
the home runs increase. (Someone asked a question here)
[][The statement “Only the slope changes as the home runs increase”
is actually incorrect. We have since modified the script and re-filmed
the video, though editing for the updated video has not been completed
yet. The correct, updated text will be “Only the INTERCEPT changes”. I
will add a comment on the video page now until the updated video is
uploaded.] Thingking and asking and answering.
[][The same is true if we swap home runs and bases on balls. In this analysis, referred to as multivariate regression, we say that the bases on balls slope beta 1 is adjusted for the home run effect. If this model is correct, then confounding has been accounted for. But how do we estimate beta 1 and beta 2 from the data? For this, we’ll learn about linear models and least squares estimates.]
[][Textbook link]
This video corresponds to the textbook section on multivariate regression. https://rafalab.github.io/dsbook/linear-models.html#multivariate-regression
[][Key points]
A first approach to check confounding is to keep HRs fixed at a certain value and then examine the relationship between BB and runs.
The slopes of BB after stratifying on HR are reduced, but they are not 0, which indicates that BB are helpful for producing runs, just not as much as previously thought.
NOTE: There is an error in the script. The quote "Only the slope changes as the home runs increase." will be corrected to "Only the INTERCEPT changes as the home runs increase." in a future version of the video.
Code
# stratify HR per game to nearest 10, filter out strata with few points
dat <- Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(HR_strata = round(HR/G, 1),
BB_per_game = BB / G,
R_per_game = R / G) %>%
filter(HR_strata >= 0.4 & HR_strata <=1.2)
# scatterplot for each HR stratum
dat %>%
ggplot(aes(BB_per_game, R_per_game)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
facet_wrap( ~ HR_strata)
## `geom_smooth()` using formula 'y ~ x'
# calculate slope of regression line after stratifying by HR
dat %>%
group_by(HR_strata) %>%
summarize(slope = cor(BB_per_game, R_per_game)*sd(R_per_game)/sd(BB_per_game))
## # A tibble: 9 × 2
## HR_strata slope
## <dbl> <dbl>
## 1 0.4 0.734
## 2 0.5 0.566
## 3 0.6 0.412
## 4 0.7 0.285
## 5 0.8 0.365
## 6 0.9 0.261
## 7 1 0.512
## 8 1.1 0.454
## 9 1.2 0.440
# stratify by BB
dat <- Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(BB_strata = round(BB/G, 1),
HR_per_game = HR / G,
R_per_game = R / G) %>%
filter(BB_strata >= 2.8 & BB_strata <=3.9)
# scatterplot for each BB stratum
dat %>% ggplot(aes(HR_per_game, R_per_game)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
facet_wrap( ~ BB_strata)
## `geom_smooth()` using formula 'y ~ x'
# slope of regression line after stratifying by BB
dat %>%
group_by(BB_strata) %>%
summarize(slope = cor(HR_per_game, R_per_game)*sd(R_per_game)/sd(HR_per_game))
## # A tibble: 12 × 2
## BB_strata slope
## <dbl> <dbl>
## 1 2.8 1.52
## 2 2.9 1.57
## 3 3 1.52
## 4 3.1 1.49
## 5 3.2 1.58
## 6 3.3 1.56
## 7 3.4 1.48
## 8 3.5 1.63
## 9 3.6 1.83
## 10 3.7 1.45
## 11 3.8 1.70
## 12 3.9 1.30
dat <- Teams %>%
filter(yearID %in% 1961:2001) %>%
mutate(HR_strata = round(HR/G, 1),
BB_per_game = BB/G,
R_per_game = R/G) %>%
filter(HR_strata >= 0.4 & HR_strata <= 1.2)
head(dat)
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 1961 AL BAL BAL <NA> 3 163 82 95 67 <NA> <NA> N
## 2 1961 AL BOS BOS <NA> 6 163 82 76 86 <NA> <NA> N
## 3 1961 AL CHA CHW <NA> 4 163 81 86 76 <NA> <NA> N
## 4 1961 NL CHN CHC <NA> 7 156 78 64 90 <NA> <NA> N
## 5 1961 NL CIN CIN <NA> 1 154 77 93 61 <NA> <NA> Y
## 6 1961 AL CLE CLE <NA> 5 161 81 78 83 <NA> <NA> N
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 N 691 5481 1393 227 36 149 581 902 39 30 NA NA 588 526 3.22 54 21 33
## 2 N 729 5508 1401 251 37 112 647 847 56 36 NA NA 792 687 4.29 35 6 30
## 3 N 765 5556 1475 216 46 138 550 612 100 40 NA NA 726 653 4.06 39 3 33
## 4 N 689 5344 1364 238 51 176 539 1027 35 25 NA NA 800 689 4.48 34 6 25
## 5 N 710 5243 1414 247 35 158 423 761 70 33 NA NA 653 575 3.78 46 12 40
## 6 N 737 5609 1493 257 39 150 492 720 34 11 NA NA 752 665 4.15 35 12 23
## IPouts HA HRA BBA SOA E DP FP name park
## 1 4413 1226 109 617 926 126 173 0.980 Baltimore Orioles Memorial Stadium
## 2 4326 1472 167 679 831 143 140 0.977 Boston Red Sox Fenway Park II
## 3 4344 1491 158 498 814 128 138 0.980 Chicago White Sox Comiskey Park
## 4 4155 1492 165 465 755 183 175 0.970 Chicago Cubs Wrigley Field
## 5 4110 1300 147 500 829 134 124 0.977 Cincinnati Reds Crosley Field
## 6 4329 1426 178 599 801 139 142 0.977 Cleveland Indians Cleveland Stadium
## attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro HR_strata BB_per_game
## 1 951089 96 96 BAL BAL BAL 0.9 3.564417
## 2 850589 102 103 BOS BOS BOS 0.7 3.969325
## 3 1146019 99 97 CHW CHA CHA 0.8 3.374233
## 4 673057 101 104 CHC CHN CHN 1.1 3.455128
## 5 1117603 102 101 CIN CIN CIN 1.0 2.746753
## 6 725547 97 98 CLE CLE CLE 0.9 3.055901
## R_per_game
## 1 4.239264
## 2 4.472393
## 3 4.693252
## 4 4.416667
## 5 4.610390
## 6 4.577640
dat %>%
ggplot2::ggplot(aes(BB_per_game, R_per_game)) +
geom_point(alpha=0.3) +
geom_smooth(formula = y ~ x, method = "lm") +
facet_wrap(~HR_strata)
dat %>%
group_by(HR_strata) %>%
summarise(slope = cor(BB_per_game, R_per_game) * sd(R_per_game) / sd(BB_per_game))
## # A tibble: 9 × 2
## HR_strata slope
## <dbl> <dbl>
## 1 0.4 0.734
## 2 0.5 0.566
## 3 0.6 0.412
## 4 0.7 0.285
## 5 0.8 0.365
## 6 0.9 0.261
## 7 1 0.512
## 8 1.1 0.454
## 9 1.2 0.440
dat <- Teams %>%
filter(yearID %in% 1961:2001) %>%
mutate(BB_strata = round(BB/G, 1),
HR_per_game = HR/G,
R_per_game = R/G) %>%
filter(BB_strata >= 2.8 & BB_strata <= 3.9)
dat %>%
ggplot(aes(HR_per_game, R_per_game)) +
geom_point(alpha=0.3) +
geom_smooth(formula = y~x, method = "lm") +
facet_wrap(~BB_strata)
dat %>%
group_by(BB_strata) %>%
summarise(slope = cor(HR_per_game, R_per_game) * sd(R_per_game) / sd(HR_per_game))
## # A tibble: 12 × 2
## BB_strata slope
## <dbl> <dbl>
## 1 2.8 1.52
## 2 2.9 1.57
## 3 3 1.52
## 4 3.1 1.49
## 5 3.2 1.58
## 6 3.3 1.56
## 7 3.4 1.48
## 8 3.5 1.63
## 9 3.6 1.83
## 10 3.7 1.45
## 11 3.8 1.70
## 12 3.9 1.30
Since Galton’s original development, regression has become
one of the most widely used tools in data science. One reason
for this has to do with the fact that [][regression permits us to
find relationships between two variables while adjusting for
others], as we have just shown for bases on balls and home runs.
This has been particularly popular in fields where randomized
experiments are hard to run
(Think, other cases ? can we apply it into analyzing app users ???),
such as economics and epidemiology. When we’re not able to
randomly assign each individual to a treatment or control group,
confounding is particularly prevalent.
For example, consider estimating the effect of any fast foods on life
expectancy using data collected from a random sample of people in some
jurisdiction. Fast food consumers are more likely to be smokers,
drinkers, and have lower incomes
(Your customer groups are already limited into certain group).
Therefore, a naive regression model may lead to an overestimate of a
negative health effect of fast foods. So how do we adjust for
confounding in practice? We can use regression. [][We have described
how, if data is bivariate normal, then the conditional expectation
follow a regression line], that the conditional expectation as a
line is not an extra assumption, but rather a result derived from the
assumption, that they are approximately bivariate normal.
[][However, in practice it is common to explicitly write down a model that describes the relationship between two or more variables using what is called a linear model]. We know that linear here does not refer to lines exclusively, but rather to the fact that the conditional expectation is a linear combination of known quantities. Any combination that multiplies them by a constant and then adds them up with, perhaps, a shift. For example, 2 plus 3x minus 4y plus 5z is a linear combination of x, y, and z. So beta 0 plus beta 1x1, plus beta 2x 2 is a linear combination of x1 and x2. The simplest linear model is a constant beta 0. The second simplest is a line, beta 0 plus beta 1x.
For Galton’s data, we would denote n observed fathers’ heights with x1 through xn. Then we model n son heights we are trying to predict with the following model. Here, the little xi’s are the father’s heights, which are fixed not random, due to the conditioning. We’ve conditioned on these values. And then Yi big Yi is the random son’s height that we want to predict. We further assume that the errors that are denoted with the Greek letter for E, epsilon, epsilon i, are independent from each other, have expected value 0, and the standard deviation, which is usually called sigma, does not depend on i. It’s the same for every individual. We know the xi, but [][to have a useful model for prediction, we need beta 0 and beta 1. We estimate these from the data]. Once we do, we can predict the sons’ heights from any father’s height, x. Note that if we further assume that the epsilons are normally distributed, then this model is exactly the same one we derived earlier for the bivariate normal distribution. A somewhat nuanced difference is that in the first approach, we assumed the data was a bivariate normal, and [][the linear model was derived, not assumed].
In practice, linear models are just assumed without necessarily assuming normality. The distribution of the epsilons is not specified. But nevertheless, if your data is bivariate normal, the linear model that we just showed holds. If your data is not bivariate normal, then you will need to have other ways of justifying the model ([][Here, what is the other way to justifying the models ???]). One reason linear models are popular is that they are interpretable. In the case of Galton’s data, we can interpret the data like this. Due to inherited genes, the son’s height prediction grows by beta 1 for each inch we increase the father’s height x. Because not all sons with fathers of height x are of equal height, we need the term epsilon, which explains the remaining variability. This remaining variability includes the mother’s genetic effect, environmental factors, and other biological randomness.
Note that given how we wrote the model, the intercept beta 0 is not very interpretable, as it is the predicted height of a son with a father with no height. Due to regression to the mean, the prediction will usually be a bit larger than 0, which is really not very interpretable. To make the intercept parameter more interpretable, we can rewrite the model slightly in the following way. Here, we have changed xi to xi minus the average height x bar. We have centered our covariate xi. In this case, beta 0, the intercept, would be the predicted height for the average father for the case where xi equals x bar.
quotation form the book: Note that if we further assume that the ε is normally distributed, then this model is exactly the same one we derived earlier by assuming bivariate normal data. A somewhat nuanced difference is that in the first approach we assumed the data was bivariate normal and that the linear model was derived, not assumed. In practice, linear models are just assumed without necessarily assuming normality: the distribution of the εs is not specified. Nevertheless, if your data is bivariate normal, the above linear model holds. If your data is not bivariate normal, then you will need to have other ways of justifying the model.
[][Textbook link]
This video corresponds to the textbook section on linear models. https://rafalab.github.io/dsbook/linear-models.html
[][Key points]
“Linear” here does not refer to lines, but rather to the fact that the conditional expectation is a linear combination of known quantities.
[][* In Galton’s model, we assume Y (son’s height) is a linear combination of a constant and X (father’s height) plus random noise. We further assume that Epsilon_i are independent from each other, have expected value 0 and the standard deviation Sigma which does not depend on i.*] Note that if we further assume that Epsilon is normally distributed, then the model is exactly the same one we derived earlier by assuming bivariate normal data. We can subtract the mean from X to make more Beta_0 interpretable.
linear combination of x, y and z
a horizontal line
second simplest linear mode have a slopel
random son heights we want to predict with conditioned fathers heights.png
In father-son heights dataset
So smart
1/1 point (graded) As described in the videos, when we stratified our regression lines for runs per game vs. bases on balls by the number of home runs, what happened? The slope of runs per game vs. bases on balls within each stratum was reduced because we removed confounding by home runs. The slope of runs per game vs. bases on balls within each stratum was reduced because there were fewer data points. The slope of runs per game vs. bases on balls within each stratum increased after we removed confounding by home runs. The slope of runs per game vs. bases on balls within each stratum stayed about the same as the original slope. correct Answer Correct: Correct.
1/1 point (graded)
We run a linear model for sons’ heights vs. fathers’ heights using the Galton height data, and get the following results:
> lm(son ~ father, data = galton_heights)
Call: lm(formula = son ~ father, data = galton_heights)
Coefficients: (Intercept) father
35.71 0.50
Interpret the numeric coefficient for “father.” For every inch we increase the son’s height, the predicted father’s height increases by 0.5 inches. For every inch we increase the father’s height, the predicted son’s height grows by 0.5 inches. For every inch we increase the father’s height, the predicted son’s height is 0.5 times greater. correct
Explanation
The coefficient for “father” gives the predicted increase in son’s height for each increase of 1 unit in the father’s height. In this case, it means that for every inch we increase the father’s height, the son’s predicted height increases by 0.5 inches.
1/1 point (graded)
We want the intercept term for our model to be more interpretable, so we run the same model as before but now we subtract the mean of fathers’ heights from each individual father’s height to create a new variable centered at zero.
galton_heights <- galton_heights %>%
mutate(father_centered=father - mean(father))
We run a linear model using this centered fathers’ height variable.
> lm(son ~ father_centered, data = galton_heights)
Call: lm(formula = son ~ father_centered, data = galton_heights)
Coefficients: (Intercept) father_centered
70.45 0.50
Interpret the numeric coefficient for the intercept. The height of a son of a father of average height is 70.45 inches. The height of a son when a father’s height is zero is 70.45 inches. The height of an average father is 70.45 inches. correct
Explanation
Because the fathers’ heights (the independent variable) have been centered on their mean, the intercept represents the height of the son of a father of average height. In this case, that means that the height of a son of a father of average height is 70.45 inches.
If we had not centered fathers’ heights to its mean, then the intercept would represent the height of a son when a father’s height is zero.
1/1 point (graded)
Suppose we fit a multivariable regression model for expected runs based on BB and HR:
E[R|BB = x_1, HR = x_2] = Beta_0 + Beta_1 * x_1 + Beta_2 * x_2
Suppose we fix . Then we observe a linear relationship between runs and HR with intercept of:
Beta_0 Beta_0 + Beta_2 * x_2 Beta_0 + Beta_1 * x_1 Beta_0 + Beta_2 * x_1
correct
Explanation
If is fixed BB=x_1, then is fixed and acts as the intercept for this regression model. This is the basis of stratificaton.
0.67/1 point (graded) Which of the following are assumptions for the errors Epsilon in a linear regression model?
Check ALL correct answers. The Epsilon are independent of each other correct The Epsilon have expected value 0 correct The variance of Epsilon is a constant correct partially correct
lm(formula = son ~ father, data = galton_heights)
##
## Call:
## lm(formula = son ~ father, data = galton_heights)
##
## Coefficients:
## (Intercept) father
## 35.7125 0.5028
Summary: R linear regression uses the lm() function to create a regression model given some formula, in the form of Y~X+X2. To look at the model, you use the summary() function.
galton_heights <- galton_heights %>%
mutate(father_centered = father - mean(father))
lm(son ~ father_centered, data = galton_heights)
##
## Call:
## lm(formula = son ~ father_centered, data = galton_heights)
##
## Coefficients:
## (Intercept) father_centered
## 70.4547 0.5028
# So intercept changed, a lot, does that mean the median father and mean son are not match?
# Or someone is not normal distributed ???
Ask your questions or make your comments about Introduction to Linear Models here! Remember, one of the best ways to reinforce your own learning is by explaining something to someone else, so we encourage you to answer each other’s questions (without giving away the answers, of course).
Some reminders:
Search the discussion board before posting to see if someone else has asked the same thing before asking a new question
Please be specific in the title and body of your post regarding which question you're asking about to facilitate answering your question.
Posting snippets of code is okay, but posting full code solutions is not.
If you do post snippets of code, please format it as code for readability. If you're not sure how to do this, there are instructions in a pinned post in the "general" discussion forum.
For linear models to be useful, we have to estimate the unknown parameters, the betas. The standard approach in science is to find the values that minimize the distance of the fitted model to the data. To quantify, this we use the [][least squares equation]. For Galton’s data, we would write something like this. This quantity is called the [][Residual Sum of Squares, RSS]. Once we find the values that minimize the RSS, we call the values the Least Squares Estimate, LSE, and denote them, in this case, with beta 0 hat and beta 1 hat.
[][Let’s write the function that computes the RSS for any pair of values, beta 0 and beta 1, for our heights data. It would look like this. So for any pair of values, we get an RSS. So this is a three-dimensional plot with beta 1 and beta 2, and x and y and the RSS as a z. To find the minimum, you would have to look at this three-dimensional plot]. Here, we’re just going to make a two-dimensional version by keeping beta 0 fixed at 25. So it will be a function of the RSS as a function of beta 1. We can use this code to produce this plot. We can see a clear minimum for beta 1 at around 0.65. So you could see how we would pick the least squares estimates. However, this minimum is for beta 1 when beta 0 is fixed at 25. But we don’t know if that’s the minimum for beta 0. We don’t know if 25 comma 0.65 minimizes the equation across all pairs.
[][****We could use trial and error, but it’s really not going to
work here (Why and How). Instead we will use calculus.****]
We’ll take the partial derivatives, set them
equal to 0, and solve for beta 1 and beta 0. Of course, if we have many
parameters, these equations can get rather complex. But there are
functions in R that do these calculations for us. We will learn these
soon. To learn the mathematics behind this, you can consult the book on
linear models.
quotation form the book: In mathematics, when we multiply each variable by a constant and then add them together, we say we formed a linear combination of the variables. For example, 3x−4y+5z is a linear combination of x, y, and z. We can also add a constant so 2+3x−4y+5z is also linear combination of x, y, and z.
[][Textbook link]
This video corresponds to the textbook section on least squares estimates. https://rafalab.github.io/dsbook/linear-models.html#lse
[][Key points]
For regression, we aim to find the coefficient values that minimize the distance of the fitted model to the data.
Residual sum of squares (RSS) measures the distance between the true value and the predicted value given by the regression line. The values that minimize the RSS are called the least squares estimates (LSE).
We can use partial derivatives to get the values for and in Galton's data.
NOTE: At timepoint 0:57 in the video, the Professor uses the terms and , but this should be and
Code
library(HistData)
library(tidyverse)
data("GaltonFamilies")
set.seed(1983)
galton_heights <- GaltonFamilies %>%
filter(gender == "male") %>%
group_by(family) %>%
sample_n(1) %>%
ungroup() %>%
select(father, childHeight) %>%
rename(son = childHeight)
#https://bookdown.org/yih_huynh/Guide-to-R-Book/groupby.html
rss <- function(beta0, beta1){
resid <- galton_heights$son - (beta0 + beta1 * galton_heights$father)
return(sum(resid^2))
}
# plot RSS as a function of beta1 when beta0=25
beta1 = seq(0, 1, len=nrow(galton_heights))
results <- data.frame(beta1 = beta1,
rss = sapply(beta1, rss, beta0 = 25))
# https://r-lang.com/r-sapply/
results %>%
ggplot(aes(beta1, rss)) +
geom_line()
rss <- function(beta0, beta1, data){
resid <- galton_heights$son - (beta0 + beta1 * galton_heights$father)
return(sum(resid^2))
}
beta1 = seq(0, 1, len=nrow(galton_heights))
results <- data.frame(beta1 = beta1,
rss = sapply(beta1, rss, beta0 = 25))
results %>%
ggplot(aes(beta1, rss)) +
#geom_line() +
geom_line(aes(beta1, rss), col=2)
[][Definition of trial and error: a finding out of the best way to reach a desired result or a correct solution by trying out one or more ways or means and by noting and eliminating errors or causes of failure also : the trying of one thing or another until something succeeds.]
[][In r, we can obtain the least squares estimates using the lm function]. To fit the following model where Yi is the son’s height and Xi is the father height, we would write the following piece of code. This gives us the least squares estimates, which we can see in the output of r. The general way we use lm is by using the tilde (~) character to let lm know which is the value we’re predicting that’s on the left side of the tilde, and which variables we’re using to predict–those will be on the right side of the tilde. The intercept is added automatically to the model. So you don’t have to include it when you write it.
The object fit that we just computed includes more information about the least squares fit. We can use the function summary to extract more of this information, like this. To understand some of the information included in this summary, we need to remember that the LSE are random variables. Mathematical statistics gives us some ideas of the distribution of these random variables. And we’ll learn some of that next. End of transcript. Skip to the start.
[][Textbook link]
This video corresponds to the textbook section on the lm function. https://rafalab.github.io/dsbook/linear-models.html#the-lm-function
[][Key points]
When calling the lm() function, the variable that we want to predict is put to the left of the ~ symbol, and the variables that we use to predict is put to the right of the ~ symbol. The intercept is added automatically.
LSEs are random variables.
Code
library(HistData)
data("GaltonFamilies")
galton_heights <- GaltonFamilies %>%
filter(childNum == 1 & gender == 'male') %>%
select(father, childHeight) %>%
rename(son=childHeight)
# fit regression line to predict son's height from father's height
fit <- lm(son ~ father, data = galton_heights)
fit
##
## Call:
## lm(formula = son ~ father, data = galton_heights)
##
## Coefficients:
## (Intercept) father
## 35.7125 0.5028
# summary statistics
summary(fit)
##
## Call:
## lm(formula = son ~ father, data = galton_heights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9022 -1.4050 0.0922 1.3422 8.0922
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.71249 4.51737 7.906 2.75e-13 ***
## father 0.50279 0.06533 7.696 9.47e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.22 on 177 degrees of freedom
## Multiple R-squared: 0.2507, Adjusted R-squared: 0.2465
## F-statistic: 59.23 on 1 and 177 DF, p-value: 9.473e-13
fit <- lm(son ~ father, data = galton_heights)
fit
##
## Call:
## lm(formula = son ~ father, data = galton_heights)
##
## Coefficients:
## (Intercept) father
## 35.7125 0.5028
summary(fit)
##
## Call:
## lm(formula = son ~ father, data = galton_heights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9022 -1.4050 0.0922 1.3422 8.0922
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.71249 4.51737 7.906 2.75e-13 ***
## father 0.50279 0.06533 7.696 9.47e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.22 on 177 degrees of freedom
## Multiple R-squared: 0.2507, Adjusted R-squared: 0.2465
## F-statistic: 59.23 on 1 and 177 DF, p-value: 9.473e-13
[][The LSE are derived from the data, Y1 through Yn, which are random. This implies that our estimates are random variables]. To see this, we can run a Monte Carlo simulation in which we assume that the son and father height data that we have defines an entire population. And we’re going to take random samples of size 50 and compute the regression slope coefficient for each one. We write this code, which gives us several estimates of the regression slope. We can see the variability of the estimates by plotting their distribution. Here you can see the histograms of the estimated beta 0’s and the estimated beta 1’s. [][The reason these look normal is because the central limit theorem applies here as well. For large enough N, the least squares estimates will be approximately normal with expected value beta 0 and beta 1 respectively].
The standard errors are a bit complicated to compute, but mathematical theory does allow us to compute them, and they are included in the summary provided by the lm function. Here are the estimated standard errors for one of our simulated data sets. You could see them at the second column in the coefficients table. [][You can see that the standard errors estimates reported by the summary function are close to the standard errors that we obtain from our Monte Carlo simulation].
The summary function also reports t-statistics–this is the t value column–and p-value. This is the Pr bigger than absolute value of t column. [][The t-statistic] is not actually based on the central limit theorem, but rather on the assumption that [][the epsilons follow a normal distribution]. Under this assumption, mathematical theory tells us that the LSE divided by their standard error, which we can see here and here, follow a t distribution with N minus p degrees of freedom, with p the number of parameters in our model, which in this case is 2.
The 2p values are testing the null hypothesis that beta 0 is 0 and beta 1 is 0 respectively.
[Go read this book][**https://www.statisticshowto.com/probability-and-statistics/null-hypothesis/]
[][Note that as we described previously, for large enough
N, the central limit works, and the t distribution becomes almost the
same as a normal distribution. So if either you assume the errors are
normal and use the t distribution or if you assume that N is large
enough to use the central limit theorem, you can construct confidence
intervals for your parameters.]
(How to construct confidnece intervals for our parameters)
We know here that although we will not show examples in this video, hypothesis testing for regression models is very commonly used in, for example, epidemiology and economics, to make statements such as the effect of A and B was statistically significant after adjusting for X, Y, and Z. But it’s very important to note that several assumptions–we just described some of them–have to hold for these statements to hold. [][the effect of A and B was statistically significant after adjusting for X, Y, and Z]
[][Textbook link]
This video corresponds to the textbook section on LSE. https://rafalab.github.io/dsbook/linear-models.html#lse-are-random-variables
[]Key points]
Because they are derived from the samples, LSE are random variables.
Beat_0 and Beta_1 appear to be normally distributed because the central limit theorem plays a role.
The t-statistic depends on the assumption that follows a normal distribution.
Code
# Monte Carlo simulation ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
B <- 1000
N <- 50
lse <- replicate(B, {
sample_n(galton_heights, N, replace = TRUE) %>%
lm(son ~ father, data = .) %>%
.$coef
})
lse <- data.frame(beta_0 = lse[1,], beta_1 = lse[2,])
# Plot the distribution of beta_0 and beta_1
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
p1 <- lse %>% ggplot(aes(beta_0)) + geom_histogram(bins = 15, color = "black")
p2 <- lse %>% ggplot(aes(beta_1)) + geom_histogram(binwidth = 0.05, color = "black")
grid.arrange(p1, p2, ncol = 2)
# summary statistics
sample_n(galton_heights, N, replace = TRUE) %>%
lm(son ~ father, data = .) %>%
summary %>%
.$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.9641244 7.6317391 4.188315 1.194960e-04
## father 0.5513639 0.1100775 5.008871 7.821676e-06
lse %>% summarize(se_0 = sd(beta_0), se_1 = sd(beta_1))
## se_0 se_1
## 1 8.99551 0.1298399
B <- 1000
N <- 50
lse <- replicate(B, {
sample_n(galton_heights, N, replace = TRUE) %>%
lm(son ~ father, data = .) %>%
.$coef
})
lse <- data.frame(beta_0 = lse[1, ], beta_1 = lse[2, ])
head(lse)
## beta_0 beta_1
## 1 37.81318 0.4748176
## 2 31.44613 0.5604301
## 3 42.86975 0.4009377
## 4 23.15062 0.6898149
## 5 34.90689 0.5175281
## 6 42.93412 0.4008286
p1 <- lse %>% ggplot2::ggplot(aes(beta_0)) + geom_histogram(binwidth = 5, color = "black")
p2 <- lse %>% ggplot2::ggplot(aes(beta_1)) + geom_histogram(binwidth = 0.1, color = "black")
grid.arrange(p1, p2, ncol = 2)
sample_n(galton_heights, N, replace = TRUE) %>%
lm(son ~ father, data = .) %>%
summary
##
## Call:
## lm(formula = son ~ father, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3600 -1.2751 0.2097 1.3369 5.0762
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.9137 7.1890 4.717 2.10e-05 ***
## father 0.5213 0.1043 4.996 8.17e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.026 on 48 degrees of freedom
## Multiple R-squared: 0.3421, Adjusted R-squared: 0.3284
## F-statistic: 24.96 on 1 and 48 DF, p-value: 8.171e-06
lse %>% summarise(se_0 = sd(beta_0), se_1 = sd(beta_1))
## se_0 se_1
## 1 9.058707 0.1306047
you can see the standard error estimates reported by the summary function are close to the standard errors that we obtained from our monte carlo simulation.png
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
hypothesis testing for regression models is very commonly used in many areas as the effecta and b was statistically significant after adjusting for x y and z.png
Although interpretation is not straight-forward, it is also useful to know that the LSE can be strongly correlated, which can be seen using this code:
lse %>% summarize(cor(beta_0, beta_1))
However, the correlation depends on how the predictors are defined or transformed.
Here we standardize the father heights, which changes x_i to x_i - x hat
.
B <- 1000 N <- 50 lse <- replicate(B, { sample_n(galton_heights, N, replace = TRUE) %>% mutate(father = father - mean(father)) %>% lm(son ~ father, data = .) %>% .$coef })
Observe what happens to the correlation in this case:
cor(lse[1,], lse[2,])
lse %>% summarize(cor(beta_0, beta_1))
## cor(beta_0, beta_1)
## 1 -0.9994401
B <- 1000
N <- 50
lse <- replicate(B, {
sample_n(galton_heights, N, replace = TRUE) %>%
mutate(father = father - mean(father)) %>%
lm(son ~ father, data = .) %>%
.$coef
})
cor(lse[1,], lse[2,])
## [1] -0.1781111
Once we fit our model, we can obtain predictions of y by
plugging the estimates (the Beta0 and Beta1 and etc.) into
the regression model. For example, if the father’s height is x,
then our prediction for y–which we’ll denote with a hat on top of the
y–for the son’s height will be the following. We’re just plugging in
beta–the estimated betas into the equation. If we plot y hat versus x,
we’ll see the regression line.
[][Note that the prediction y hat is also a random variable, and mathematical theory tells us what the standard errors are]. If we assume the errors are normal or have a large enough sample size to use the Central Limit Theorem, we can construct confidence intervals for our predictions, as well. In fact, the ggplot layer geom underscore smooth, when we set method equals to lm–we’ve previously shown this for several plots–plots confidence intervals around the predicted y hat. Let’s look at an example with this code. You can see the regression line. Those are the predictions, and you see a band around them. Those are the confidence intervals.
[][The R function predict() takes an lm object as input and returns these predictions]. We can see it here in this code which produces this plot, and if requested the standard errors and other information from which we can construct confidence intervals can be obtained from the predict function. You can see it by running this code.
[][You must go read the book, read this is absolately not enough, and you forget] https://rafalab.github.io/dsbook/linear-models.html#predicted-values-are-random-variables ********************************************************************************************************
[][Textbook link]
This video corresponds to the textbook section on predicted values. https://rafalab.github.io/dsbook/linear-models.html#predicted-values-are-random-variables
[][Key points]
The predicted value is often denoted as Y_hat (image you put `^` on top of `Y`), which is a random variable. Mathematical theory tells us what the standard error of the predicted value is.
The predict() function in R can give us predictions directly.
Code
# plot predictions and confidence intervals
galton_heights %>%
ggplot(aes(father, son)) +
geom_point() +
geom_smooth(formula = y ~ x, method = "lm")
# predict Y directly
fit <- galton_heights %>%
lm(son ~ father, data = .)
Y_hat <- predict(fit, se.fit = TRUE) # https://stats.stackexchange.com/questions/86624/se-of-fit-versus-se-of-prediction
Y_hat
## $fit
## 1 2 3 4 5 6 7 8
## 75.18154 73.67317 73.42177 73.42177 73.42177 72.91898 72.91898 72.41619
## 9 10 11 12 13 14 15 16
## 72.41619 72.41619 72.41619 72.41619 72.26536 71.91340 71.91340 71.91340
## 17 18 19 20 21 22 23 24
## 71.91340 71.91340 71.91340 72.16480 71.91340 71.91340 71.41061 71.41061
## 25 26 27 28 29 30 31 32
## 71.41061 71.41061 71.41061 71.41061 71.76257 71.41061 71.66201 71.66201
## 33 34 35 36 37 38 39 40
## 71.41061 71.76257 71.41061 71.66201 71.41061 71.41061 71.41061 71.41061
## 41 42 43 44 45 46 47 48
## 71.41061 71.41061 71.41061 71.41061 71.41061 70.90782 70.90782 70.90782
## 49 50 51 52 53 54 55 56
## 70.90782 70.90782 71.15922 70.90782 70.90782 70.90782 70.90782 70.90782
## 57 58 59 60 61 62 63 64
## 70.90782 70.90782 70.90782 70.90782 70.90782 71.15922 70.90782 70.90782
## 65 66 67 68 69 70 71 72
## 70.90782 71.15922 71.15922 70.90782 70.90782 70.90782 71.15922 71.05866
## 73 74 75 76 77 78 79 80
## 71.15922 70.90782 70.90782 70.90782 70.90782 70.40503 70.40503 70.40503
## 81 82 83 84 85 86 87 88
## 70.40503 70.40503 70.40503 70.65643 70.40503 70.65643 70.40503 70.40503
## 89 90 91 92 93 94 95 96
## 70.65643 70.50559 70.40503 70.40503 70.40503 70.40503 70.40503 70.65643
## 97 98 99 100 101 102 103 104
## 70.40503 70.65643 70.40503 70.40503 70.65643 70.40503 70.40503 70.40503
## 105 106 107 108 109 110 111 112
## 70.40503 70.25419 70.15364 70.15364 69.90224 69.90224 69.90224 69.90224
## 113 114 115 116 117 118 119 120
## 70.15364 70.15364 69.90224 69.90224 69.90224 69.90224 69.90224 70.15364
## 121 122 123 124 125 126 127 128
## 69.90224 69.90224 69.90224 69.90224 70.15364 69.90224 70.00280 69.90224
## 129 130 131 132 133 134 135 136
## 70.25419 69.90224 69.90224 69.90224 69.90224 69.90224 70.15364 69.90224
## 137 138 139 140 141 142 143 144
## 69.39945 69.39945 69.39945 69.39945 69.39945 69.39945 69.39945 69.65085
## 145 146 147 148 149 150 151 152
## 69.39945 69.39945 69.39945 69.65085 69.39945 68.89666 68.89666 68.89666
## 153 154 155 156 157 158 159 160
## 68.89666 69.14806 68.89666 68.89666 69.14806 69.14806 68.89666 68.89666
## 161 162 163 164 165 166 167 168
## 68.89666 68.89666 68.39387 68.39387 68.39387 68.39387 68.39387 68.39387
## 169 170 171 172 173 174 175 176
## 68.39387 68.39387 68.39387 68.64526 68.64526 67.89108 67.89108 67.89108
## 177 178 179
## 67.89108 66.88550 67.13689
##
## $se.fit
## [1] 0.6362006 0.4499005 0.4197098 0.4197098 0.4197098 0.3606249 0.3606249
## [8] 0.3041064 0.3041064 0.3041064 0.3041064 0.3041064 0.2878792 0.2518878
## [15] 0.2518878 0.2518878 0.2518878 0.2518878 0.2518878 0.2773032 0.2518878
## [22] 0.2518878 0.2072453 0.2072453 0.2072453 0.2072453 0.2072453 0.2072453
## [29] 0.2374913 0.2072453 0.2283243 0.2283243 0.2072453 0.2374913 0.2072453
## [36] 0.2283243 0.2072453 0.2072453 0.2072453 0.2072453 0.2072453 0.2072453
## [43] 0.2072453 0.2072453 0.2072453 0.1760402 0.1760402 0.1760402 0.1760402
## [50] 0.1760402 0.1894818 0.1760402 0.1760402 0.1760402 0.1760402 0.1760402
## [57] 0.1760402 0.1760402 0.1760402 0.1760402 0.1760402 0.1894818 0.1760402
## [64] 0.1760402 0.1760402 0.1894818 0.1894818 0.1760402 0.1760402 0.1760402
## [71] 0.1894818 0.1835265 0.1894818 0.1760402 0.1760402 0.1760402 0.1760402
## [78] 0.1660303 0.1660303 0.1660303 0.1660303 0.1660303 0.1660303 0.1679615
## [85] 0.1660303 0.1679615 0.1660303 0.1660303 0.1679615 0.1660361 0.1660303
## [92] 0.1660303 0.1660303 0.1660303 0.1660303 0.1679615 0.1660303 0.1679615
## [99] 0.1660303 0.1660303 0.1679615 0.1660303 0.1660303 0.1660303 0.1660303
## [106] 0.1679388 0.1704558 0.1704558 0.1807716 0.1807716 0.1807716 0.1807716
## [113] 0.1704558 0.1704558 0.1807716 0.1807716 0.1807716 0.1807716 0.1807716
## [120] 0.1704558 0.1807716 0.1807716 0.1807716 0.1807716 0.1704558 0.1807716
## [127] 0.1759914 0.1807716 0.1679388 0.1807716 0.1807716 0.1807716 0.1807716
## [134] 0.1807716 0.1704558 0.1807716 0.2152371 0.2152371 0.2152371 0.2152371
## [141] 0.2152371 0.2152371 0.2152371 0.1960501 0.2152371 0.2152371 0.2152371
## [148] 0.1960501 0.2152371 0.2617481 0.2617481 0.2617481 0.2617481 0.2373868
## [155] 0.2617481 0.2617481 0.2373868 0.2373868 0.2617481 0.2617481 0.2617481
## [162] 0.2617481 0.3150135 0.3150135 0.3150135 0.3150135 0.3150135 0.3150135
## [169] 0.3150135 0.3150135 0.3150135 0.2877599 0.2877599 0.3721442 0.3721442
## [176] 0.3721442 0.3721442 0.4925607 0.4619348
##
## $df
## [1] 177
##
## $residual.scale
## [1] 2.219652
names(Y_hat)
## [1] "fit" "se.fit" "df" "residual.scale"
# plot best fit line
galton_heights %>%
mutate(Y_hat = predict(lm(son ~ father, data=.))) %>%
ggplot(aes(father, Y_hat))+
geom_line()
galton_heights %>%
ggplot(aes(son, father)) +
geom_point() +
geom_smooth(formula= y ~ x, method="lm")
galton_heights %>%
mutate(Y_hat = predict(lm(formula = son ~ father, data = .))) %>%
ggplot(aes(father, Y_hat)) +
geom_line()
galton_heights_hat <- galton_heights %>%
mutate(Y_hat = predict(lm(son ~ father, data=.)))
head(galton_heights_hat)
## father son Y_hat
## 1 78.5 73.2 75.18154
## 2 75.5 73.5 73.67317
## 3 75.0 71.0 73.42177
## 4 75.0 70.5 73.42177
## 5 75.0 72.0 73.42177
## 6 74.0 76.5 72.91898
fit <- galton_heights %>%
lm(formula = son ~ father, data = .)
Y_hat <- predict(fit, se.fit = TRUE)
names(Y_hat)
## [1] "fit" "se.fit" "df" "residual.scale"
[][Why do we construct confidence intervals? Why are confidence
intervals important? Because confidence intervals represent the range of
scores that are likely if we were to repeat the survey, they are
important to consider when generalizing results.] https://www.google.com/search?client=firefox-b-e&q=construct+confidence+intervals
Go read the book
Comprehension Check due Jun 12, 2022 00:29 AWST # Question 1 1/1 point (graded)
The following code was used in the video to plot RSS with Beta_0 = 25.
beta1 = seq(0, 1, len=nrow(galton_heights))
results <- data.frame(beta1 = beta1,
rss = sapply(beta1, rss, beta0 = 25))
results %>%
ggplot(aes(beta1, rss)) +
geom_line() +
geom_line(aes(beta1, rss), col=2)
In a model for sons’ heights vs fathers’ heights, what is the least squares estimate (LSE) for Beta_1 if we assume Beta_0 hat is 36?
Hint: modify the code above to do your analysis. 0.65 0.5 0.2 12 correct Answer Correct: Correct. You can tell from a plot of RSS vs that the minimum estimate is 0.5
Explanation
Using the code from the video, you can plot RSS vs to find the value for that minimizes the RSS. In this case, that value is 0.5 when we assume that
is 36.
When we assumed that was 25, as in the sample code, the LSE for was 0.65.
[][Need to understand the below R code, how the blocks are put together and so on]
beta1 = seq(0, 1, len=nrow(galton_heights))
results <- data.frame(beta1 = beta1,
rss = sapply(beta1, rss, beta0 = 36))
results %>%
ggplot(aes(beta1, rss)) +
geom_line(aes(beta1, rss), col=2)
[][Residual sum of squares (RSS) measures the distance between the
true value and the predicted value given by the regression line. The
values that minimize the RSS are called the least squares estimates
(LSE)]
1/1 point (graded)
The least squares estimates for the parameters Beta_0, Beta_1, Beta_2, … Beta_n minimize the Residual Sum of Squares
correct
the residual sum of squares. You have used 1 of 1 attempt Some
1/1 point (graded)
Load the Lahman library and filter the Teams data frame to the years 1961-2001. Run a linear model in R predicting the number of runs per game based on both the number of bases on balls per game and the number of home runs per game. What is the coefficient for bases on balls? 0.39 1.56 1.74 0.027 correct Answer Correct: Correct.
Explanation
The coefficient for bases on balls is 0.39; the coefficient for home runs is 1.56; the intercept is 1.74; the standard error for the BB coefficient is 0.027.
library(broom) # tidy() function
Teams_small <- Teams %>%
filter(yearID %in% 1961:2001)
Teams_small %>%
mutate(R_per_game = R/G, BB_per_game = BB/G, HR_per_game = HR/G) %>%
do(tidy(lm(R_per_game ~ BB_per_game + HR_per_game, data = .)))
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.74 0.0824 21.2 7.62e- 83
## 2 BB_per_game 0.387 0.0270 14.3 1.20e- 42
## 3 HR_per_game 1.56 0.0490 31.9 1.78e-155
Teams_small <- Teams %>%
filter(yearID %in% 1961:2001)
Teams_small %>%
mutate(R_per_game = R/G, BB_per_game = BB/G, HR_per_game = HR/G) %>%
lm(R_per_game ~ BB_per_game + HR_per_game, data = .) %>%
#coef
summary # Check here, why its different with/without %>% .$coef after summary
##
## Call:
## lm(formula = R_per_game ~ BB_per_game + HR_per_game, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.87325 -0.24507 -0.01449 0.23866 1.24218
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.74430 0.08236 21.18 <2e-16 ***
## BB_per_game 0.38742 0.02701 14.34 <2e-16 ***
## HR_per_game 1.56117 0.04896 31.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3484 on 1023 degrees of freedom
## Multiple R-squared: 0.6503, Adjusted R-squared: 0.6496
## F-statistic: 951.2 on 2 and 1023 DF, p-value: < 2.2e-16
Teams_small <- Teams %>%
filter(yearID %in% 1961:2001)
Teams_small %>%
mutate(R_per_game = R/G, BB_per_game = BB/G, HR_per_game = HR/G) %>%
lm(R_per_game ~ BB_per_game + HR_per_game, data = .) %>%
summary %>%
.$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7443011 0.08235597 21.18002 7.617810e-83
## BB_per_game 0.3874238 0.02701124 14.34306 1.198143e-42
## HR_per_game 1.5611689 0.04896000 31.88662 1.777062e-155
You have used 1 of 2 attempts Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Answers are displayed within the problem
0.5/1 point (graded)
We run a Monte Carlo simulation where we repeatedly take samples of N = 100 from the Galton heights data and compute the regression slope coefficients for each sample:
B <- 1000
N <- 100
lse <- replicate(B, {
sample_n(galton_heights, N, replace = TRUE) %>%
lm(son ~ father, data = .) %>%
.$coef
})
lse <- data.frame(beta_0 = lse[1,], beta_1 = lse[2,])
What does the central limit theorem tell us about the variables beta_0 and beta_1?
Select ALL that apply. They are approximately normally
distributed. correct The expected value of each is the true
value of Beta_0 and Beta_1 (assuming the Galton heights data is a
complete population). Should think about this one
correct The central limit theorem does not apply in this
situation.** [][Wrong Choice, restructure your mind] It allows us to
test the hypothesis that Beta_0 = 0 and Beta_1 = 0 . partially correct
Answer Incorrect: Correct. With a large enough N
(So how much fit this large enough ? Or should I plot the histogram every time to see if its a bell curve),
the distributions of both beta_0 and beta_1 are approximately normal.
Try again. For large enough N, the central limit theorem does apply.
Explanation #
================================================================================================================================
With a large enough N, the central limit theorem applies and tells us
that the distributions of both beta_0 and beta_1 are approximately
normal. [][The expected (Expected) values of beta_0 and
beta_1 are the true values of Beta_0 and Beta_1, assuming that the
Galton heights data are a complete population.]
For hypothesis testing, we assume that the errors in the model are normally distributed. You have used 2 of 2 attempts
1/1 point (graded) Which R code(s) below would properly plot the predictions and confidence intervals for our linear model of sons’ heights?
NOTE: The function as.tibble() has been replaced by as_tibble() in a recent dplyr update.
Select ALL that apply.
galton_heights %>%
ggplot(aes(father, son)) +
geom_point(alpha = 0.3) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# `geom_smooth()` using method = 'loess' and formula 'y ~ x' by default,
# scatterplot for each HR stratum
#dat %>%
# ggplot(aes(BB_per_game, R_per_game)) +
# geom_point(alpha = 0.5) +
# geom_smooth(method = "lm") +
# facet_wrap( ~ HR_strata)
data("GaltonFamilies")
galton_heights <- GaltonFamilies %>%
filter(gender == "male") %>%
select(father, childHeight) %>%
rename(son = childHeight)
galton_heights %>%
ggplot(aes(father, son)) +
geom_point() +
geom_smooth(formula= y ~ x, method = "lm")
model <- lm(son ~ father, data = galton_heights)
predictions <- predict(model, interval = c("confidence"), level = 0.95)
data <- as_tibble(predictions) %>%
bind_cols(father = galton_heights$father)
ggplot(data, aes(x = father, y = fit)) +
geom_line(color = "blue", size = 1) +
geom_ribbon(aes(ymin=lwr, ymax=upr), alpha=0.2) +
geom_point(data = galton_heights, aes(x = father, y = son))
# Not WOrking R Code Block
model <- lm(son ~ father, data = galton_heights)
predictions <- predict(model, data = galton_heights)
data <- as_tibble(predictions) %>%
bind_cols(father = galton_heights$father)
ggplot(data, aes(x = father, y = fit)) +
geom_line(color = "blue", size = 1) +
geom_point(data = galton_heights, aes(x = father, y = son))
Answer Correct: Correct. This is one way to plot predictions and confidence intervals for a linear model of sons’ heights vs. fathers’ heights. This is one of two correct answers. Correct. This code uses the predict command to generate predictions and 95% confidence intervals for the linear model of sons’ heights vs. fathers’ heights. This is one of two correct answers.
[][Explanation]
If using the geom_smooth command, you need to specify that method = “lm” in your geom_smooth command, otherwise the smooth line is a loess smooth and not a linear model.
If using the predict command, you need to include the confidence intervals on your figure by first specifying that you want confidence intervals in the predict command, and then adding them to your figure as a geom_ribbon. You have used 1 of 2 attempts Some
2.0/2.0 points (graded)
Fit a linear regression model predicting the mothers’ heights using daughters’ heights. What is the slope of the model? correct
0.31 Loading
Explanation
The following code can be used to determine the slope:
fit <- lm(mother ~ daughter, data = female_heights)
fit$coef[2]
What the intercept of the model? correct
44.2 Loading
Explanation
The following code can be used to determine the intercept:
fit$coef[1]
[][****Think about this code, how to manage put them block by block and doing what we need****]
#set.seed(1989) #if you are using R 3.5 or earlier
set.seed(1989, sample.kind="Rounding") #if you are using R 3.6 or later
## Warning in set.seed(1989, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
data("GaltonFamilies")
options(digits = 3) # report 3 significant digits
female_heights <- GaltonFamilies %>%
filter(gender == "female") %>%
group_by(family) %>%
sample_n(1) %>%
ungroup() %>% # Why we need the ungroup() here ??? I commented this code and below mutate() function failed
select(mother, childHeight) %>%
rename(daughter = childHeight)
female_heights %>%
#do(tidy(lm(mother ~ daughter, data = .))) # same as below code pipes
lm(mother ~ daughter, data = .) %>%
summary %>% .$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.18 4.4105 10.02 6.20e-19
## daughter 0.31 0.0686 4.53 1.11e-05
fit <- lm(formula = mother ~ daughter, data = female_heights)
fit$coefficients
## (Intercept) daughter
## 44.18 0.31
female_heights_hat <- female_heights %>%
mutate(Y_hat = predict(lm(mother ~ daughter, data = .)))
head(female_heights_hat)
## # A tibble: 6 × 3
## mother daughter Y_hat
## <dbl> <dbl> <dbl>
## 1 67 69 65.6
## 2 66.5 65.5 64.5
## 3 64 68 65.3
## 4 64 64.5 64.2
## 5 58.5 66.5 64.8
## 6 68 69.5 65.7
2.0/2.0 points (graded)
Predict mothers’ heights using the model. What is the predicted height of the first mother in the dataset? correct
65.6 Loading
Explanation
The following code can be used to determine the height:
predict(fit)[1]
What is the actual height of the first mother in the dataset? correct
67 Loading
Explanation
The following code can be used to determine the height:
female_heights$mother[1]
#set.seed(1989) #if you are using R 3.5 or earlier
set.seed(1989, sample.kind="Rounding") #if you are using R 3.6 or later
## Warning in set.seed(1989, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
data("GaltonFamilies")
options(digits = 3) # report 3 significant digits
female_heights <- GaltonFamilies %>%
filter(gender == "female") %>%
group_by(family) %>%
sample_n(1) %>%
ungroup() %>%
select(mother, childHeight) %>%
rename(daughter = childHeight)
# Solution Here #########################################################################################
fit <- lm(mother ~ daughter, data = female_heights)
fit$coefficients
## (Intercept) daughter
## 44.18 0.31
predict(fit)[1]
## 1
## 65.6
female_heights$mother[1]
## [1] 67
We have shown how BB and singles have similar predictive power for scoring runs. Another way to compare the usefulness of these baseball metrics is by assessing how stable they are across the years. Because we have to pick players based on their previous performances, we will prefer metrics that are more stable. In these exercises, we will compare the stability of singles and BBs.
Before we get started, we want to generate two tables: one for 2002 and another for the average of 1999-2001 seasons. We want to define per plate appearance statistics, keeping only players with more than 100 plate appearances. Here is how we create the 2002 table:
library(Lahman)
bat_02 <- Batting %>% filter(yearID == 2002) %>%
mutate(pa = AB + BB, singles = (H - X2B - X3B - HR)/pa, bb = BB/pa) %>%
filter(pa >= 100) %>%
select(playerID, singles, bb)
2.0/2.0 points (graded)
Now compute a similar table but with rates computed over 1999-2001. Keep only rows from 1999-2001 where players have 100 or more plate appearances, calculate each player’s single rate and BB rate per stint (where each row is one stint - a player can have multiple stints within a season), then calculate the average single rate (mean_singles) and average BB rate (mean_bb) per player over the three year period. How many players had a single rate mean_singles of greater than 0.2 per plate appearance over 1999-2001? correct
46 Loading
The following code can be used to determine the number of players:
bat_99_01 <- Batting %>%
filter(yearID %in% 1999:2001) %>%
mutate(pa = AB + BB, singles = (H - X2B - X3B - HR)/pa, bb = BB/pa) %>%
filter(pa >= 100) %>%
group_by(playerID) %>%
summarize(mean_singles = mean(singles), mean_bb = mean(bb))
sum(bat_99_01$mean_singles > 0.2)
How many players had a BB rate mean_bb of greater than 0.2 per plate appearance over 1999-2001? correct
3 Loading
The following code can be used to determine the number of players:
sum(bat_99_01$mean_bb > 0.2)
library(Lahman)
bat_02 <- Batting %>%
filter(yearID == 2002) %>%
mutate(pa = AB + BB, singles = (H - X2B - X3B - HR)/pa, bb = BB/pa) %>%
filter(pa >= 100) %>%
select(playerID, singles, bb)
bat_01 <- Batting %>%
filter(yearID %in% 1999:2001) %>%
mutate(pa = AB + BB, singles = (H - X2B - X3B - HR)/pa, bb = BB/pa) %>%
filter(pa >= 100) %>%
select(playerID, singles, bb)
length(bat_01[[1]])
## [1] 1368
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
bat_01_n1 <- bat_01 %>%
group_by(playerID) %>%
summarise_at(vars(singles, bb), list(name = mean)) %>%
filter(singles_name > 0.2)
# https://www.statology.org/r-mean-by-group/
bat_01_n1
## # A tibble: 46 × 3
## playerID singles_name bb_name
## <chr> <dbl> <dbl>
## 1 bautida01 0.200 0.0498
## 2 boggswa01 0.215 0.115
## 3 brownad01 0.203 0.107
## 4 bushho01 0.218 0.0424
## 5 cairomi01 0.211 0.0787
## 6 carusmi01 0.209 0.0364
## 7 castilu01 0.220 0.119
## 8 cedenro01 0.211 0.108
## 9 cirilje01 0.205 0.0953
## 10 delluda01 0.221 0.0919
## # … with 36 more rows
length(bat_01_n1[[1]])
## [1] 46
bat_01_n2 <- bat_01 %>%
group_by(playerID) %>%
summarise_at(vars(singles, bb), list(name = mean)) %>%
filter(bb_name > 0.2)
# https://www.statology.org/r-mean-by-group/
# I suppose Google worth half my brain, or my brain is not functional as it should thus need Google
bat_01_n2
## # A tibble: 3 × 3
## playerID singles_name bb_name
## <chr> <dbl> <dbl>
## 1 bondsba01 0.0907 0.213
## 2 jahajo01 0.119 0.217
## 3 mcgwima01 0.0853 0.202
length(bat_01_n2[[1]])
## [1] 3
2.0/2.0 points (graded)
Use inner_join() to combine the bat_02 table with the table of 1999-2001 rate averages you created in the previous question. What is the correlation between 2002 singles rates and 1999-2001 average singles rates? correct
0.551 Loading
The following code can be used to determine the correlation:
dat <- inner_join(bat_02, bat_99_01)w
cor(dat$singles, dat$mean_singles)
What is the correlation between 2002 BB rates and 1999-2001 average BB rates? correct
0.717 Loading
The following code can be used to determine the correlation:
cor(dat$bb, dat$mean_bb)
bat_01_avg <- bat_01 %>%
group_by(playerID) %>%
summarise_at(vars(singles, bb), list(name = mean)) %>%
ungroup()
head(inner_join(bat_02, bat_01_avg, by = 'playerID'))
## playerID singles bb singles_name bb_name
## 1 abernbr01 0.180 0.0512 0.178 0.0816
## 2 abreubo01 0.148 0.1538 0.153 0.1557
## 3 agbaybe01 0.118 0.0787 0.162 0.1153
## 4 alfoned01 0.197 0.1123 0.161 0.1228
## 5 alicelu01 0.160 0.1190 0.168 0.1001
## 6 alomaro01 0.182 0.0881 0.186 0.1222
length(inner_join(bat_02, bat_01_avg, by = 'playerID')[[1]])
## [1] 392
inner_join(bat_02, bat_01_avg, by = 'playerID') %>%
summarise(cor(singles, singles_name))
## cor(singles, singles_name)
## 1 0.551
inner_join(bat_02, bat_01_avg, by = 'playerID') %>%
summarise(cor(bb, bb_name))
## cor(bb, bb_name)
## 1 0.717
1/1 point (graded)
Make scatterplots of mean_singles versus singles and mean_bb versus bb. Are either of these distributions bivariate normal? Neither distribution is bivariate normal. singles and mean_singles are bivariate normal, but bb and mean_bb are not. bb and mean_bb are bivariate normal, but singles and mean_singles are not. Both distributions are bivariate normal. correct
Both distributions are bivariate normal, as can be seen in the scatter plots made using the following code:
dat %>%
ggplot(aes(singles, mean_singles)) +
geom_point()
dat %>%
ggplot(aes(bb, mean_bb)) +
geom_point()
inner_join(bat_02, bat_01_avg, by = 'playerID') %>%
ggplot(aes(singles, singles_name)) +
geom_point(alpha = 0.3)
inner_join(bat_02, bat_01_avg, by = 'playerID') %>%
ggplot(aes(bb, bb_name)) +
geom_point(alpha = 0.3)
2/2 points (graded)
Fit a linear model to predict 2002 singles given 1999-2001 mean_singles. What is the coefficient of mean_singles, the slope of the fit? correct
0.588 Loading
The linear model and slope can be generated using the following code:
fit_singles <- lm(singles ~ mean_singles, data = dat)
fit_singles$coef[2]
Fit a linear model to predict 2002 bb given 1999-2001 mean_bb. What is the coefficient of mean_bb, the slope of the fit? correct
0.829 Loading
fit_bb <- lm(bb ~ mean_bb, data = dat)
fit_bb$coef[2]
head(inner_join(bat_02, bat_01_avg, by = 'playerID'))
## playerID singles bb singles_name bb_name
## 1 abernbr01 0.180 0.0512 0.178 0.0816
## 2 abreubo01 0.148 0.1538 0.153 0.1557
## 3 agbaybe01 0.118 0.0787 0.162 0.1153
## 4 alfoned01 0.197 0.1123 0.161 0.1228
## 5 alicelu01 0.160 0.1190 0.168 0.1001
## 6 alomaro01 0.182 0.0881 0.186 0.1222
inner_join(bat_02, bat_01_avg, by = 'playerID') %>%
lm(singles ~ singles_name, data = .) %>%
summary
##
## Call:
## lm(formula = singles ~ singles_name, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.08380 -0.01673 -0.00108 0.01666 0.06894
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.06206 0.00742 8.37 1.1e-15 ***
## singles_name 0.58813 0.04511 13.04 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0247 on 390 degrees of freedom
## Multiple R-squared: 0.304, Adjusted R-squared: 0.302
## F-statistic: 170 on 1 and 390 DF, p-value: <2e-16
inner_join(bat_02, bat_01_avg, by = 'playerID') %>%
lm(bb ~ bb_name, data = .) %>%
summary
##
## Call:
## lm(formula = bb ~ bb_name, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.06738 -0.01695 -0.00136 0.01527 0.13777
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01548 0.00391 3.96 9e-05 ***
## bb_name 0.82905 0.04076 20.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0265 on 390 degrees of freedom
## Multiple R-squared: 0.515, Adjusted R-squared: 0.514
## F-statistic: 414 on 1 and 390 DF, p-value: <2e-16
Let’s go back to baseball. In a previous example, we estimated the regression lines to predict runs from bases and balls in different home run strata. We first constructed a data frame similar to this. Then, to compute the regression line in each strata, since we didn’t know the lm function back then, we used the formula directly like this. We argued that the slopes are similar and that the differences were perhaps due to random variation. To provide a more rigorous defense of the slopes being the same, which is what led to our multivariate regression model, [][we could compute confidence intervals for each slope]. We have not learned the formula for this, but the lm function provides enough information to construct them.
First, note that if we try to use the lm function to get the
estimated slope like this, we don’t get what we want. The lm function
ignored the group_by. This is expected, because lm is not part of the
tidyverse and does not know how to handle the outcome of group_by which
is a group tibble.
([][See the outcome list we receive by summarize() function and the .$coef after the lm ??])
We’re going to describe tibbles in some details now. When summarize
receives the output of group_by, it somehow knows which rows of the
table go with which groups. But where is this information stored in the
data frame? Let’s write some code to see the output of a group_by call.
Note that there are no columns with the information needed to define the
groups. But if you look closely at the output, you notice the line “A
tibble, 6 by 3.” We can learn the class of the return object using this
line of code, and we see that the class is a “tbl.” This is pronounced
“tibble.” It is also a tbl_df. This is equivalent to tibble. The
tibble is a special kind of data frame (what makes it special
??). We have seen them before, because [][tidyverse functions such
as group_by and also summarize always return this type of data
frame]. The group_by function returns a special kind of
tibble, the grouped tibble. We will say more about the grouped
tibbles later. Note that the manipulation verbs, [][select, filter,
mutate, and arrange, don’t necessarily return tibbles. They preserve the
class of the input]. If they receive a regular data frame, they
return a regular data frame. If they receive a tibble, they return a
tibble. But []tibbles are the default data frame for the
tidyverse]. Tibbles are very similar to data frames. You
can think of them as modern versions of data frames. Next, we’re going
to describe briefly three important differences.
[][Textbook link]
This video corresponds to the textbook section discussing tibbles. https://rafalab.github.io/dsbook/linear-models.html#linear-regression-in-the-tidyverse
[][Key points]
Tibbles can be regarded as a modern version of data frames and are the default data structure in the tidyverse.
Some functions that do not work properly with data frames do work with tibbles.
Code
# stratify by HR
dat <- Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(HR = round(HR/G, 1),
BB = BB/G,
R = R/G) %>%
select(HR, BB, R) %>%
filter(HR >= 0.4 & HR<=1.2)
# calculate slope of regression lines to predict runs by BB in different HR strata
dat %>%
group_by(HR) %>%
summarize(slope = cor(BB,R)*sd(R)/sd(BB))
# use lm to get estimated slopes - lm does not work with grouped tibbles
dat %>%
group_by(HR) %>%
lm(R ~ BB, data = .) %>%
.$coef
# inspect a grouped tibble
dat %>% group_by(HR) %>% head()
dat %>% group_by(HR) %>% class()
library(tidyverse)
library(Lahman)
# stratify HR per game to nearest 10, filter out strata with few points
dat <- Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(HR_strata = round(HR/G, 1),
BB_per_game = BB / G,
R_per_game = R / G) %>%
select(HR_strata, BB_per_game, R_per_game) %>% # Why doing select() here?
filter(HR_strata >= 0.4 & HR_strata <=1.2)
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# scatterplot for each HR stratum
dat %>%
group_by(HR_strata) %>%
summary(slope = cor(BB_per_game, R_per_game)*sd(R_per_game)/sd(BB_per_game)) # This is my code, see the difference ??
## HR_strata BB_per_game R_per_game
## Min. :0.400 Min. :2.13 Min. :2.86
## 1st Qu.:0.700 1st Qu.:2.97 1st Qu.:3.93
## Median :0.800 Median :3.23 Median :4.27
## Mean :0.818 Mean :3.27 Mean :4.28
## 3rd Qu.:1.000 3rd Qu.:3.56 3rd Qu.:4.63
## Max. :1.200 Max. :4.78 Max. :5.98
#https://statisticsglobe.com/summary-function-in-r/ like "df.describe()" in Python's Panda DataFrame
# calculate slope of regression line after stratifying by HR
dat %>%
group_by(HR_strata) %>%
summarize(slope = cor(BB_per_game, R_per_game)*sd(R_per_game)/sd(BB_per_game))
## # A tibble: 9 × 2
## HR_strata slope
## <dbl> <dbl>
## 1 0.4 0.734
## 2 0.5 0.566
## 3 0.6 0.412
## 4 0.7 0.285
## 5 0.8 0.365
## 6 0.9 0.261
## 7 1 0.512
## 8 1.1 0.454
## 9 1.2 0.440
library(broom) # tidy() function
Teams_small <- Teams %>%
filter(yearID %in% 1961:2001)
Teams_small %>%
mutate(R_per_game = R/G, BB_per_game = BB/G, HR_per_game = HR/G) %>%
do(tidy(lm(R_per_game ~ BB_per_game + HR_per_game, data = .)))
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.74 0.0824 21.2 7.62e- 83
## 2 BB_per_game 0.387 0.0270 14.3 1.20e- 42
## 3 HR_per_game 1.56 0.0490 31.9 1.78e-155
dat %>%
group_by(HR_strata) %>%
lm(R_per_game ~ BB_per_game, data = .) %>%
.$coef
## (Intercept) BB_per_game
## 2.198 0.638
dat %>%
group_by(HR_strata) %>%
lm(R_per_game ~ BB_per_game, data = .) %>%
#.$coef
summary
##
## Call:
## lm(formula = R_per_game ~ BB_per_game, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2883 -0.3195 0.0113 0.3388 1.4105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1984 0.1135 19.4 <2e-16 ***
## BB_per_game 0.6379 0.0344 18.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.454 on 947 degrees of freedom
## Multiple R-squared: 0.266, Adjusted R-squared: 0.265
## F-statistic: 343 on 1 and 947 DF, p-value: <2e-16
first note if we try to use lm function to get the estimates we dont get what we want
dat %>%
group_by(HR_strata) %>%
head()
## # A tibble: 6 × 3
## # Groups: HR_strata [5]
## HR_strata BB_per_game R_per_game
## <dbl> <dbl> <dbl>
## 1 0.9 3.56 4.24
## 2 0.7 3.97 4.47
## 3 0.8 3.37 4.69
## 4 1.1 3.46 4.42
## 5 1 2.75 4.61
## 6 0.9 3.06 4.58
dat %>%
group_by(HR_strata) %>%
class()
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
Never imaged study the Data Structure in R
In this video, we’re going to describe some of the [][main differences between tibbles and data frames]. First, the print method for tibble is much more readable than that of a data frame. To see this, type teams on your console after loading the Baseball Lahman Database. And you will see a very, very long list of columns and rows. It’s barely readable. Teams is a data frame with many rows and columns. That’s why you see that. Nevertheless, the output just shows everything wraps around and is hard to read. It is so bad that we don’t even print it here. We’ll let you print it on your own screen. Now if you convert this data frame to a tibble data frame, the output is much more readable. Here’s an example. That’s the first main difference between tibbles and data frames.
A second one is that if you subset the columns of a data frame, you may get back an object that is not a data frame. With tibbles, this is not the case. Here’s an example. If we subset a tibble, we get back a tibble. This is useful in the tidyverse since functions require data frames as input. Now whenever you want to access the original vector that defines a column in a table, for this, you actually have to use the accessor dollar sign. A related feature to this is that tibbles will give you a warning if you try to access a column that does not exist. That’s not the case for data frames. For example, if we accidentally write hr in lowercase instead of uppercase, with a data frame, all we get is a no. No warning. This can make it quite hard to debug code. In contrast, if it’s a tibble, and you try to access the lowercase hr column, which doesn’t exist, you actually get a warning.
A third difference is that while columns of a data frame need to be a vector of number, strings or Boolean, tibbles can have more complex objects, such as lists or functions. Also note that we can create tibbles with the tibble function. So, look at this line of code. We’re creating a column that actually has functions in it. You can see the output here.
Finally, the last difference we describe is that tibbles can be grouped. The function group by returns a special kind of tibble, a grouped tibble. This class stores information that lets you know which rows are in which groups. The tidyverse functions, in particular [][the summarize functions, are aware of the group information]. In the example we showed, we saw that the lm function, which is not part of the tidyverse, does not know how to deal with group tibbles. The object is basically converted to a regular data frame, and then the function runs ignoring the groups. This is why we only get one pair of estimates, as we see here. To make these non-tidyverse function better integrate with a tidyverse, we will learn a new function, the function do.
[][Key points] Tibbles are more readable than data frames. If you subset a data frame, you may not get a data frame. If you subset a tibble, you always get a tibble. Tibbles can hold more complex objects such as lists or functions. Tibbles can be grouped.
[][Code] # inspect data frame and tibble Teams as_tibble(Teams) # Note that the function was formerly called as.tibble()
class(Teams[,20])
class(as_tibble(Teams[,20]))
class(as_tibble(Teams)$HR)
Teams\(hr as_tibble(Teams)\)HR
tibble(id = c(1, 2, 3), func = c(mean, median, sd))
head(Teams)
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 1871 NA BS1 BNA <NA> 3 31 NA 20 10 <NA> <NA> N
## 2 1871 NA CH1 CNA <NA> 2 28 NA 19 9 <NA> <NA> N
## 3 1871 NA CL1 CFC <NA> 8 29 NA 10 19 <NA> <NA> N
## 4 1871 NA FW1 KEK <NA> 7 19 NA 7 12 <NA> <NA> N
## 5 1871 NA NY2 NNA <NA> 5 33 NA 16 17 <NA> <NA> N
## 6 1871 NA PH1 PNA <NA> 1 28 NA 21 7 <NA> <NA> Y
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 <NA> 401 1372 426 70 37 3 60 19 73 16 NA NA 303 109 3.55 22 1 3
## 2 <NA> 302 1196 323 52 21 10 60 22 69 21 NA NA 241 77 2.76 25 0 1
## 3 <NA> 249 1186 328 35 40 7 26 25 18 8 NA NA 341 116 4.11 23 0 0
## 4 <NA> 137 746 178 19 8 2 33 9 16 4 NA NA 243 97 5.17 19 1 0
## 5 <NA> 302 1404 403 43 21 1 33 15 46 15 NA NA 313 121 3.72 32 1 0
## 6 <NA> 376 1281 410 66 27 9 46 23 56 12 NA NA 266 137 4.95 27 0 0
## IPouts HA HRA BBA SOA E DP FP name
## 1 828 367 2 42 23 243 24 0.834 Boston Red Stockings
## 2 753 308 6 28 22 229 16 0.829 Chicago White Stockings
## 3 762 346 13 53 34 234 15 0.818 Cleveland Forest Citys
## 4 507 261 5 21 17 163 8 0.803 Fort Wayne Kekiongas
## 5 879 373 7 42 22 235 14 0.840 New York Mutuals
## 6 747 329 3 53 16 194 13 0.845 Philadelphia Athletics
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1 South End Grounds I NA 103 98 BOS BS1
## 2 Union Base-Ball Grounds NA 104 102 CHI CH1
## 3 National Association Grounds NA 96 100 CLE CL1
## 4 Hamilton Field NA 101 107 KEK FW1
## 5 Union Grounds (Brooklyn) NA 90 88 NYU NY2
## 6 Jefferson Street Grounds NA 102 98 ATH PH1
## teamIDretro
## 1 BS1
## 2 CH1
## 3 CL1
## 4 FW1
## 5 NY2
## 6 PH1
tibble(Teams)
## # A tibble: 2,985 × 48
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin
## <int> <fct> <fct> <fct> <chr> <int> <int> <int> <int> <int> <chr> <chr>
## 1 1871 NA BS1 BNA <NA> 3 31 NA 20 10 <NA> <NA>
## 2 1871 NA CH1 CNA <NA> 2 28 NA 19 9 <NA> <NA>
## 3 1871 NA CL1 CFC <NA> 8 29 NA 10 19 <NA> <NA>
## 4 1871 NA FW1 KEK <NA> 7 19 NA 7 12 <NA> <NA>
## 5 1871 NA NY2 NNA <NA> 5 33 NA 16 17 <NA> <NA>
## 6 1871 NA PH1 PNA <NA> 1 28 NA 21 7 <NA> <NA>
## 7 1871 NA RC1 ROK <NA> 9 25 NA 4 21 <NA> <NA>
## 8 1871 NA TRO TRO <NA> 6 29 NA 13 15 <NA> <NA>
## 9 1871 NA WS3 OLY <NA> 4 32 NA 15 15 <NA> <NA>
## 10 1872 NA BL1 BLC <NA> 2 58 NA 35 19 <NA> <NA>
## # … with 2,975 more rows, and 36 more variables: LgWin <chr>, WSWin <chr>,
## # R <int>, AB <int>, H <int>, X2B <int>, X3B <int>, HR <int>, BB <int>,
## # SO <int>, SB <int>, CS <int>, HBP <int>, SF <int>, RA <int>, ER <int>,
## # ERA <dbl>, CG <int>, SHO <int>, SV <int>, IPouts <int>, HA <int>,
## # HRA <int>, BBA <int>, SOA <int>, E <int>, DP <int>, FP <dbl>, name <chr>,
## # park <chr>, attendance <int>, BPF <int>, PPF <int>, teamIDBR <chr>,
## # teamIDlahman45 <chr>, teamIDretro <chr>
as_tibble(Teams)
## # A tibble: 2,985 × 48
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin
## <int> <fct> <fct> <fct> <chr> <int> <int> <int> <int> <int> <chr> <chr>
## 1 1871 NA BS1 BNA <NA> 3 31 NA 20 10 <NA> <NA>
## 2 1871 NA CH1 CNA <NA> 2 28 NA 19 9 <NA> <NA>
## 3 1871 NA CL1 CFC <NA> 8 29 NA 10 19 <NA> <NA>
## 4 1871 NA FW1 KEK <NA> 7 19 NA 7 12 <NA> <NA>
## 5 1871 NA NY2 NNA <NA> 5 33 NA 16 17 <NA> <NA>
## 6 1871 NA PH1 PNA <NA> 1 28 NA 21 7 <NA> <NA>
## 7 1871 NA RC1 ROK <NA> 9 25 NA 4 21 <NA> <NA>
## 8 1871 NA TRO TRO <NA> 6 29 NA 13 15 <NA> <NA>
## 9 1871 NA WS3 OLY <NA> 4 32 NA 15 15 <NA> <NA>
## 10 1872 NA BL1 BLC <NA> 2 58 NA 35 19 <NA> <NA>
## # … with 2,975 more rows, and 36 more variables: LgWin <chr>, WSWin <chr>,
## # R <int>, AB <int>, H <int>, X2B <int>, X3B <int>, HR <int>, BB <int>,
## # SO <int>, SB <int>, CS <int>, HBP <int>, SF <int>, RA <int>, ER <int>,
## # ERA <dbl>, CG <int>, SHO <int>, SV <int>, IPouts <int>, HA <int>,
## # HRA <int>, BBA <int>, SOA <int>, E <int>, DP <int>, FP <dbl>, name <chr>,
## # park <chr>, attendance <int>, BPF <int>, PPF <int>, teamIDBR <chr>,
## # teamIDlahman45 <chr>, teamIDretro <chr>
Teams$yearID %>% class()
## [1] "integer"
Teams[, 20] %>% class() ###########################################################################################################
## [1] "integer"
tibble(Teams$yearID) %>% class()
## [1] "tbl_df" "tbl" "data.frame"
tibble(Teams)$yearID %>% class() # tibble doesn't offer you suggesting after $ sign
## [1] "integer"
tibble(Teams)$hello %>% class()
## Warning: Unknown or uninitialised column: `hello`.
## [1] "NULL"
library(tidyverse)
library(Lahman)
# stratify HR per game to nearest 10, filter out strata with few points
dat <- Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(HR_strata = round(HR/G, 1),
BB_per_game = BB / G,
R_per_game = R / G) %>%
select(HR_strata, BB_per_game, R_per_game) %>% # Why doing select() here?
filter(HR_strata >= 0.4 & HR_strata <=1.2)
summary(lm(R_per_game ~ BB_per_game, data = dat))
##
## Call:
## lm(formula = R_per_game ~ BB_per_game, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2883 -0.3195 0.0113 0.3388 1.4105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1984 0.1135 19.4 <2e-16 ***
## BB_per_game 0.6379 0.0344 18.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.454 on 947 degrees of freedom
## Multiple R-squared: 0.266, Adjusted R-squared: 0.265
## F-statistic: 343 on 1 and 947 DF, p-value: <2e-16
dat %>%
group_by(HR_strata) %>%
do(aa = lm(R_per_game ~ BB_per_game, data = .)) %>% # Think, Think, Think, Think, Think, Think
.$aa
## [[1]]
##
## Call:
## lm(formula = R_per_game ~ BB_per_game, data = .)
##
## Coefficients:
## (Intercept) BB_per_game
## 1.359 0.734
##
##
## [[2]]
##
## Call:
## lm(formula = R_per_game ~ BB_per_game, data = .)
##
## Coefficients:
## (Intercept) BB_per_game
## 2.007 0.566
##
##
## [[3]]
##
## Call:
## lm(formula = R_per_game ~ BB_per_game, data = .)
##
## Coefficients:
## (Intercept) BB_per_game
## 2.533 0.412
##
##
## [[4]]
##
## Call:
## lm(formula = R_per_game ~ BB_per_game, data = .)
##
## Coefficients:
## (Intercept) BB_per_game
## 3.208 0.285
##
##
## [[5]]
##
## Call:
## lm(formula = R_per_game ~ BB_per_game, data = .)
##
## Coefficients:
## (Intercept) BB_per_game
## 3.070 0.365
##
##
## [[6]]
##
## Call:
## lm(formula = R_per_game ~ BB_per_game, data = .)
##
## Coefficients:
## (Intercept) BB_per_game
## 3.542 0.261
##
##
## [[7]]
##
## Call:
## lm(formula = R_per_game ~ BB_per_game, data = .)
##
## Coefficients:
## (Intercept) BB_per_game
## 2.881 0.512
##
##
## [[8]]
##
## Call:
## lm(formula = R_per_game ~ BB_per_game, data = .)
##
## Coefficients:
## (Intercept) BB_per_game
## 3.211 0.454
##
##
## [[9]]
##
## Call:
## lm(formula = R_per_game ~ BB_per_game, data = .)
##
## Coefficients:
## (Intercept) BB_per_game
## 3.40 0.44
#.$aa %>%$coefficients Why we can't assess the wanted elements with this way?
#data.frame(slope = .$aa$coefficients[2])
# summary(slope = cor(BB_per_game, R_per_game)*sd(R_per_game)/sd(BB_per_game))
dat %>%
group_by(HR_strata) %>%
do(data.frame(slope = lm(R_per_game ~ BB_per_game, data = .)$coefficients, # Now you see why adding [2] after `$coefficients` ?
se = summary(lm(R_per_game ~ BB_per_game, data = .))$coefficients[2,2]))
## # A tibble: 18 × 3
## # Groups: HR_strata [9]
## HR_strata slope se
## <dbl> <dbl> <dbl>
## 1 0.4 1.36 0.208
## 2 0.4 0.734 0.208
## 3 0.5 2.01 0.110
## 4 0.5 0.566 0.110
## 5 0.6 2.53 0.0974
## 6 0.6 0.412 0.0974
## 7 0.7 3.21 0.0705
## 8 0.7 0.285 0.0705
## 9 0.8 3.07 0.0653
## 10 0.8 0.365 0.0653
## 11 0.9 3.54 0.0751
## 12 0.9 0.261 0.0751
## 13 1 2.88 0.0751
## 14 1 0.512 0.0751
## 15 1.1 3.21 0.0855
## 16 1.1 0.454 0.0855
## 17 1.2 3.40 0.0801
## 18 1.2 0.440 0.0801
# =============================================================================================================================
# Need to properly extract the desired value from the lm or other object ======================================================
# Whe you use `$coefficients` to access lm() object, the first one [1] is the Intercept, and second one [2] is slope
# then how about the summarize(), think about how we can assess desired elements form it
====This is the data structure of summary() object, we can assess `$coefficients` then apply [2, 2] second row and second column
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.19837 0.11349 19.37 <2e-16 ***
BB_per_game 0.63788 0.03444 18.52 <2e-16 ***
====================================================================================================================================
In this video, we’ll describe the very useful do( ) function. The tidyverse functions know how to interpret group tibbles. Furthermore, to facilitate stringing commands through the pipe, tidyverse function consistently return data frames. Since this assures that the output of 1 is accepted as the input of another. But most our functions do not recognize group tibbles, nor do they return data frames. [][The lm( ) function is an example.]
The do( ) function serves as a bridge between our functions, such as lm( ) and the tidyverse. The do( ) function understands group tibbles and always returns a data frame. So let’s try to use the do( ) function to fit a regression line to each home run strata. We would do it like this. Notice that [][we did in fact fit a regression line to each strata] (when would we not ?). But the do( ) function would create a data frame with the first column being the strata value. And a column named fit. We chose that name. It could be any name. And that column will contain the result of the lm( ) call. Therefore, the return tibbles has a column with lm( ) objects in the cells, which is not very useful. Also note that if we do not name a column, then do( ) will return the actual output of lm( ), not a data frame.(understand how R doing its job) And this will result in an error since do( ) is expecting a data frame as output.
If you write this, you will get the following error. [][For a useful data frame to be constructed, the output of the function, inside do( ), must be a data frame as well]. (But we did stored lm object in one column before, using a variable) We could build a function that returns only what you want in the form of a data frame. (that is if you know the data structure of outcome object well) We could write for example, this function. And then we use to do( ) without naming the output, since we are already getting a data frame. We can write this very simple piece of code and now we get the expected result. We get the slope for each strata and the standard error for that slope. If we name the output, then we get a column containing the data frame. So if we write this piece of code, now once again, we get one of these complex tibbles with a column having a data frame in each cell. Which is again, not very useful.
All right. Now we’re going to cover one last feature of do( ). If the data frame being returned has more than one row, these will be concatenated appropriately. Here’s an example in which return both estimated parameters of each strata and their value and error. The slope and intercept. We write this piece of code. And now we use the do( ) function as we used it before, and get a very useful tibble, giving us the estimates of the slope and intercept, as well as the standard errors. Now, if you think this is all a bit too complicated, you’re not alone. To simplify things, we’re going to introduce the broom package, which was designed to facilitate the use of model fitting functions such as lm( ) with the tidyverse.
[][Textbook link]
This video corresponds to the textbook section discussing the do function. https://rafalab.github.io/dsbook/linear-models.html#linear-regression-in-the-tidyverse
[][Key points]
The do() function serves as a bridge between R functions, such as lm(), and the tidyverse.
We have to specify a column when using the do() function, otherwise we will get an error.
If the data frame being returned has more than one row, the rows will be concatenated appropriately.
Code
# use do to fit a regression line to each HR stratum
dat %>%
group_by(HR_strata) %>%
do(fit = lm(R_per_game ~ BB_per_game, data = .))
## # A tibble: 9 × 2
## # Rowwise:
## HR_strata fit
## <dbl> <list>
## 1 0.4 <lm>
## 2 0.5 <lm>
## 3 0.6 <lm>
## 4 0.7 <lm>
## 5 0.8 <lm>
## 6 0.9 <lm>
## 7 1 <lm>
## 8 1.1 <lm>
## 9 1.2 <lm>
not working r
# using do without a column name gives an error
dat %>%
group_by(HR_strata) %>%
do(lm(R_per_game ~ BB_per_game, data = .)) # SO if you dont pass lm object to a variable, the data frame will not be build
# define a function to extract slope from lm
get_slope <- function(data){
fit <- lm(R_per_game ~ BB_per_game, data = data)
data.frame(slope = fit$coefficients[2],
se = summary(fit)$coefficient[2,2])
}
# return the desired data frame
dat %>%
group_by(HR_strata) %>%
do(get_slope(.))
## # A tibble: 9 × 3
## # Groups: HR_strata [9]
## HR_strata slope se
## <dbl> <dbl> <dbl>
## 1 0.4 0.734 0.208
## 2 0.5 0.566 0.110
## 3 0.6 0.412 0.0974
## 4 0.7 0.285 0.0705
## 5 0.8 0.365 0.0653
## 6 0.9 0.261 0.0751
## 7 1 0.512 0.0751
## 8 1.1 0.454 0.0855
## 9 1.2 0.440 0.0801
# not the desired output: a column containing data frames
dat %>%
group_by(HR_strata) %>%
do(slope = get_slope(.))
## # A tibble: 9 × 2
## # Rowwise:
## HR_strata slope
## <dbl> <list>
## 1 0.4 <df [1 × 2]>
## 2 0.5 <df [1 × 2]>
## 3 0.6 <df [1 × 2]>
## 4 0.7 <df [1 × 2]>
## 5 0.8 <df [1 × 2]>
## 6 0.9 <df [1 × 2]>
## 7 1 <df [1 × 2]>
## 8 1.1 <df [1 × 2]>
## 9 1.2 <df [1 × 2]>
# data frames with multiple rows will be concatenated appropriately
get_lse <- function(data){
fit <- lm(R_per_game ~ BB_per_game, data = data)
data.frame(term = names(fit$coefficients),
estimate = fit$coefficients,
se = summary(fit)$coefficient[,2])
}
dat %>%
group_by(HR_strata) %>%
do(get_lse(.))
## # A tibble: 18 × 4
## # Groups: HR_strata [9]
## HR_strata term estimate se
## <dbl> <chr> <dbl> <dbl>
## 1 0.4 (Intercept) 1.36 0.631
## 2 0.4 BB_per_game 0.734 0.208
## 3 0.5 (Intercept) 2.01 0.344
## 4 0.5 BB_per_game 0.566 0.110
## 5 0.6 (Intercept) 2.53 0.305
## 6 0.6 BB_per_game 0.412 0.0974
## 7 0.7 (Intercept) 3.21 0.225
## 8 0.7 BB_per_game 0.285 0.0705
## 9 0.8 (Intercept) 3.07 0.213
## 10 0.8 BB_per_game 0.365 0.0653
## 11 0.9 (Intercept) 3.54 0.251
## 12 0.9 BB_per_game 0.261 0.0751
## 13 1 (Intercept) 2.88 0.256
## 14 1 BB_per_game 0.512 0.0751
## 15 1.1 (Intercept) 3.21 0.300
## 16 1.1 BB_per_game 0.454 0.0855
## 17 1.2 (Intercept) 3.40 0.291
## 18 1.2 BB_per_game 0.440 0.0801
dat %>%
group_by(HR_strata) %>%
do(fit = lm(R_per_game ~ BB_per_game, data = .))
## # A tibble: 9 × 2
## # Rowwise:
## HR_strata fit
## <dbl> <list>
## 1 0.4 <lm>
## 2 0.5 <lm>
## 3 0.6 <lm>
## 4 0.7 <lm>
## 5 0.8 <lm>
## 6 0.9 <lm>
## 7 1 <lm>
## 8 1.1 <lm>
## 9 1.2 <lm>
the do function would create a data frame with the first column being strata value and a column named fit.png
# Google R function() function, why we don't need to use comma between fit and data.frame() function
get_slope <- function(data){
fit <- lm(R_per_game ~ BB_per_game, data = data)
data.frame(slope = fit$coefficients[2],
se = summary(fit)$coefficient[2, 2])
}
# ================================================================================================================================= #
# What we just did is a Magic
dat %>%
group_by(HR_strata) %>%
do(get_slope(.))
## # A tibble: 9 × 3
## # Groups: HR_strata [9]
## HR_strata slope se
## <dbl> <dbl> <dbl>
## 1 0.4 0.734 0.208
## 2 0.5 0.566 0.110
## 3 0.6 0.412 0.0974
## 4 0.7 0.285 0.0705
## 5 0.8 0.365 0.0653
## 6 0.9 0.261 0.0751
## 7 1 0.512 0.0751
## 8 1.1 0.454 0.0855
## 9 1.2 0.440 0.0801
# Google R function() function, why we don't need to use comma between fit and data.frame() function
get_slope <- function(data){
fit <- lm(R_per_game ~ BB_per_game, data = data)
data.frame(items = names(fit$coefficients),
slope = fit$coefficients,
se = summary(fit)$coefficient[, 2]) # Think about the data structure in R, and how to access the desired values #
}
# ================================================================================================================================= #
# ================================================================================================================================= #
# What we just did is a Magic
dat %>%
group_by(HR_strata) %>%
do(get_slope(.))
## # A tibble: 18 × 4
## # Groups: HR_strata [9]
## HR_strata items slope se
## <dbl> <chr> <dbl> <dbl>
## 1 0.4 (Intercept) 1.36 0.631
## 2 0.4 BB_per_game 0.734 0.208
## 3 0.5 (Intercept) 2.01 0.344
## 4 0.5 BB_per_game 0.566 0.110
## 5 0.6 (Intercept) 2.53 0.305
## 6 0.6 BB_per_game 0.412 0.0974
## 7 0.7 (Intercept) 3.21 0.225
## 8 0.7 BB_per_game 0.285 0.0705
## 9 0.8 (Intercept) 3.07 0.213
## 10 0.8 BB_per_game 0.365 0.0653
## 11 0.9 (Intercept) 3.54 0.251
## 12 0.9 BB_per_game 0.261 0.0751
## 13 1 (Intercept) 2.88 0.256
## 14 1 BB_per_game 0.512 0.0751
## 15 1.1 (Intercept) 3.21 0.300
## 16 1.1 BB_per_game 0.454 0.0855
## 17 1.2 (Intercept) 3.40 0.291
## 18 1.2 BB_per_game 0.440 0.0801
The original task we ask for in a previous video was to provide an estimate and a confidence interval for the slope estimates of each strata ( Do you remember how this question was raised?? ). The broom package will make this quite easy. [][Broom has three main functions all of which extract information from the object returned by the function lm(), and return it in a tidy verse friendly data frame]. These functions are tidy, glance and augment. The tidy function returns estimates and related information as a data frame. Here’s an example. We can add other important summaries, such as confidence intervals, using arguments like this. Because the outcome is a data frame, we can immediately use it with do() function to string together the commands that produce the table we are after. So this piece of code will generate what we wanted to see.
Because a data frame is returned, we can filter and select the rows and columns we want. So this simple piece of code gives us exactly the table we asked for. We have filtered away the intercept rows, and only show the columns we care about, the estimate and the confidence intervals. Furthermore, [][a table like this makes visualization with ggplot quite easy]. So this piece of code produces this nice plot, which provides very useful information. Now we return to discussing our original task of determining if slopes change. The plot we just made using do and broom shows that the confidence intervals overlap, which provides a nice visual confirmation that our assumption that the slopes do not change with home run strata, is relatively safe. Earlier we mentioned two other functions from the broom package, glance and augment. Glance and augment relate to model specific and observation specific outcomes, respectively. Here we can see the model fit summary the glance returns. You can learn more about these summaries in any regression textbook. We’ll see an example of augment in a future video.
[][Textbook link] This video corresponds to the textbook section on the broom package. https://rafalab.github.io/dsbook/linear-models.html#the-broom-package
[][Key points] The broom package has three main functions, all of which extract information from the object returned by lm and return it in a tidyverse friendly data frame. The tidy() function returns estimates and related information as a data frame. The functions glance() and augment() relate to model specific and observation specific outcomes respectively.
[][Code]
# use tidy to return lm estimates and related information as a data frame
library(broom)
fit <- lm(R ~ BB, data = dat)
tidy(fit)
# add confidence intervals with tidy
tidy(fit, conf.int = TRUE)
# pipeline with lm, do, tidy
dat %>%
group_by(HR) %>%
do(tidy(lm(R ~ BB, data = .), conf.int = TRUE)) %>%
filter(term == "BB") %>%
select(HR, estimate, conf.low, conf.high)
# make ggplots
dat %>%
group_by(HR) %>%
do(tidy(lm(R ~ BB, data = .), conf.int = TRUE)) %>%
filter(term == "BB") %>%
select(HR, estimate, conf.low, conf.high) %>%
ggplot(aes(HR, y = estimate, ymin = conf.low, ymax = conf.high)) +
geom_errorbar() +
geom_point()
# inspect with glance
glance(fit)
library(broom) # tidy() function
library(tidyverse)
library(Lahman)
Teams_small <- Teams %>%
filter(yearID %in% 1961:2001)
small_dat <- Teams_small %>%
mutate(R_per_game = R/G, BB_per_game = BB/G, HR_per_game = round(HR/G, digits = 1)) %>%
select(HR_per_game, R_per_game, BB_per_game) %>%
filter(HR_per_game > 0.5 & HR_per_game < 1.2)
s_dat <- small_dat %>%
group_by(HR_per_game)
fit <- lm(R_per_game ~ BB_per_game, data = s_dat)
tidy(fit)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.57 0.119 21.5 2.80e-81
## 2 BB_per_game 0.531 0.0362 14.7 2.81e-43
tidy(fit, conf.int = TRUE) # Adding cofidence intervals
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.57 0.119 21.5 2.80e-81 2.33 2.80
## 2 BB_per_game 0.531 0.0362 14.7 2.81e-43 0.460 0.602
#do(tidy(lm(R_per_game ~ BB_per_game, data = .))) %>%
small_dat %>%
group_by(HR_per_game) %>%
do(tidy(lm(R_per_game ~ BB_per_game, data = .), conf.int = TRUE)) # Now can you recall how can we access the slope only ???
## # A tibble: 12 × 8
## # Groups: HR_per_game [6]
## HR_per_game term estimate std.error statistic p.value conf.low conf.high
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.6 (Interc… 2.53 0.305 8.32 2.43e-13 1.93 3.14
## 2 0.6 BB_per_… 0.412 0.0974 4.23 4.80e- 5 0.219 0.605
## 3 0.7 (Interc… 3.21 0.225 14.3 1.49e-30 2.76 3.65
## 4 0.7 BB_per_… 0.285 0.0705 4.05 7.93e- 5 0.146 0.425
## 5 0.8 (Interc… 3.07 0.213 14.4 5.40e-31 2.65 3.49
## 6 0.8 BB_per_… 0.365 0.0653 5.59 9.13e- 8 0.236 0.494
## 7 0.9 (Interc… 3.54 0.251 14.1 8.77e-29 3.05 4.04
## 8 0.9 BB_per_… 0.261 0.0751 3.47 6.85e- 4 0.112 0.409
## 9 1 (Interc… 2.88 0.256 11.3 6.62e-21 2.37 3.39
## 10 1 BB_per_… 0.512 0.0751 6.81 3.28e-10 0.363 0.660
## 11 1.1 (Interc… 3.21 0.300 10.7 6.46e-17 2.61 3.81
## 12 1.1 BB_per_… 0.454 0.0855 5.31 1.03e- 6 0.284 0.624
small_dat %>%
group_by(HR_per_game) %>%
do(tidy(lm(R_per_game ~ BB_per_game, data = .), conf.int = TRUE)) %>%
filter(term == "BB_per_game")
## # A tibble: 6 × 8
## # Groups: HR_per_game [6]
## HR_per_game term estimate std.error statistic p.value conf.low conf.high
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.6 BB_per_g… 0.412 0.0974 4.23 4.80e- 5 0.219 0.605
## 2 0.7 BB_per_g… 0.285 0.0705 4.05 7.93e- 5 0.146 0.425
## 3 0.8 BB_per_g… 0.365 0.0653 5.59 9.13e- 8 0.236 0.494
## 4 0.9 BB_per_g… 0.261 0.0751 3.47 6.85e- 4 0.112 0.409
## 5 1 BB_per_g… 0.512 0.0751 6.81 3.28e-10 0.363 0.660
## 6 1.1 BB_per_g… 0.454 0.0855 5.31 1.03e- 6 0.284 0.624
small_dat %>%
group_by(HR_per_game) %>%
do(tidy(lm(R_per_game ~ BB_per_game, data = .), conf.int = TRUE)) %>%
filter(term == "BB_per_game") %>%
ggplot(aes(HR_per_game, y = estimate, ymin = conf.low, ymax = conf.high)) +
geom_errorbar() +
geom_point()
# |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
confidence intervals overlaps
library(tidyverse)
library(broom) # tidy() function
library(Lahman)
Teams_small <- Teams %>%
filter(yearID %in% 1961:2001)
small_dat <- Teams_small %>%
mutate(R_per_game = R/G, BB_per_game = BB/G, HR_per_game = round(HR/G, digits = 1)) %>%
select(HR_per_game, R_per_game, BB_per_game) %>%
filter(HR_per_game > 0.5 & HR_per_game < 1.2)
small_dat %>%
group_by(HR_per_game) %>%
#do(tidy(lm(R_per_game ~ BB_per_game, data = .))) ### Can we apply filter() after this? Say we want Intercept only
do(tidy(lm(R_per_game ~ BB_per_game, data = .))) %>%
filter(., term != "BB_per_game")
## # A tibble: 6 × 6
## # Groups: HR_per_game [6]
## HR_per_game term estimate std.error statistic p.value
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 0.6 (Intercept) 2.53 0.305 8.32 2.43e-13
## 2 0.7 (Intercept) 3.21 0.225 14.3 1.49e-30
## 3 0.8 (Intercept) 3.07 0.213 14.4 5.40e-31
## 4 0.9 (Intercept) 3.54 0.251 14.1 8.77e-29
## 5 1 (Intercept) 2.88 0.256 11.3 6.62e-21
## 6 1.1 (Intercept) 3.21 0.300 10.7 6.46e-17
# \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
small_dat %>%
group_by(HR_per_game) %>%
do(data.frame(name = names(lm(R_per_game ~ BB_per_game, data = .)$coefficient[1]),
value = tidy(lm(R_per_game ~ BB_per_game, data = .))$estimate[1],
se = tidy(lm(R_per_game ~ BB_per_game, data = .))$std.error[1]))
## # A tibble: 6 × 4
## # Groups: HR_per_game [6]
## HR_per_game name value se
## <dbl> <chr> <dbl> <dbl>
## 1 0.6 (Intercept) 2.53 0.305
## 2 0.7 (Intercept) 3.21 0.225
## 3 0.8 (Intercept) 3.07 0.213
## 4 0.9 (Intercept) 3.54 0.251
## 5 1 (Intercept) 2.88 0.256
## 6 1.1 (Intercept) 3.21 0.300
# //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
# Should we BoxPlot it? And see the
# stratify HR per game to nearest 10, filter out strata with few points
dat <- Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(HR_strata = round(HR/G, 1),
BB_per_game = BB / G,
R_per_game = R / G) %>%
select(HR_strata, BB_per_game, R_per_game) %>% # Why doing select() here?
filter(HR_strata >= 0.4 & HR_strata <=1.2)
summary(lm(R_per_game ~ BB_per_game, data = dat))
##
## Call:
## lm(formula = R_per_game ~ BB_per_game, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2883 -0.3195 0.0113 0.3388 1.4105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1984 0.1135 19.4 <2e-16 ***
## BB_per_game 0.6379 0.0344 18.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.454 on 947 degrees of freedom
## Multiple R-squared: 0.266, Adjusted R-squared: 0.265
## F-statistic: 343 on 1 and 947 DF, p-value: <2e-16
dat %>%
group_by(HR_strata) %>%
#do(data.frame(aa = lm(R_per_game ~ BB_per_game, data = .))) # Think, Think, Think, Think, Think, Think
do(aa = lm(R_per_game ~ BB_per_game, data = .))
## # A tibble: 9 × 2
## # Rowwise:
## HR_strata aa
## <dbl> <list>
## 1 0.4 <lm>
## 2 0.5 <lm>
## 3 0.6 <lm>
## 4 0.7 <lm>
## 5 0.8 <lm>
## 6 0.9 <lm>
## 7 1 <lm>
## 8 1.1 <lm>
## 9 1.2 <lm>
Comprehension Check due Jun 12, 2022 00:29 AWST # Question 1 1/1 point (graded)
As seen in the videos, what problem do we encounter when we try to run a linear model on our baseball data, grouping by home runs?
There is not enough data in some levels to run the model. The lm() function does not know how to handle grouped tibbles. [][The key is to read, to think, thus proving answer here is bad] The results of the lm() function cannot be put into a tidy format. correct Explanation
The lm() function does not know how to handle grouped tibbles, so we can’t simply run a linear model on the baseball data grouped by home runs. We need something to bridge between the grouped tibble and the lm() function.
Show answer Submit You have used 1 of 1 attemptSome
1/1 point (graded) Tibbles are similar to what other class in R?
Vectors Matrices Data frames Lists correct Explanation
Tibbles are essentially modern versions of data frames.
1/1 point (graded) What are some advantages of tibbles compared to data frames? Select ALL that apply.
Tibbles display better. If you subset a tibble, you always get back a tibble. Tibbles can have complex entries. Tibbles can be grouped. correct Answer Correct: Correct, this is one advantage of a tibble. There are several other correct answers listed here. Correct, this is one advantage of a tibble. There are several other correct answers listed here. Correct, this is one advantage of a tibble. There are several other correct answers listed here. Correct, this is one advantage of a tibble. There are several other correct answers listed here. Explanation
All of the listed answers are advantages of tibbles when compared to data frames: tibbles display better, they always return tibbles when subsetted, they can have complex entries, and they can be grouped.
SaveSave your answer Show answer Submit You have used 1 of 2 attempts
1/1 point (graded) What are two advantages of the do() command, when applied to the tidyverse? Select TWO.
It is faster than normal functions. It returns useful error messages. It understands grouped tibbles. It always returns a data.frame. correct Answer Correct: Correct. The do function can understand grouped tibbles. Correct. The do function always returns a data.frame. Explanation
The do function serves as a useful bridge between base R functions and the tidyverse. It understands grouped tibbles and always returns a data.frame.
SaveSave your answer Show answer Submit You have used 1 of 2 attempts
1/1 point (graded) You want to take the tibble dat, which we used in the video on the do() function, and run the linear model R ~ BB for each strata of HR. Then you want to add three new columns to your grouped tibble: the coefficient, standard error, and p-value for the BB term in the model.
You’ve already written the function get_slope(), shown below.
dat <- Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(HR = round(HR/G, 1),
BB = BB / G,
R = R / G) %>%
select(HR, BB, R) %>% # Why doing select() here?
filter(HR >= 0.4 & HR <=1.2)
get_slope <- function(data) {
fit <- lm(R ~ BB, data = data)
sum.fit <- summary(fit)
data.frame(slope = sum.fit$coefficients[2, "Estimate"],
se = sum.fit$coefficients[2, "Std. Error"],
pvalue = sum.fit$coefficients[2, "Pr(>|t|)"])
}
What additional code could you write to accomplish your goal?
dat %>%
group_by(HR) %>%
do(get_slope)
dat %>%
group_by(HR) %>%
do(get_slope(.))
dat %>%
group_by(HR) %>%
do(slope = get_slope(.))
dat %>%
do(get_slope(.))
correct Answer Correct:Correct. This will create a tibble with four columns: HR, slope, se, and pvalue for each level of HR. Explanation
dat %>%
group_by(HR) %>%
do(get_slope(.))
## # A tibble: 9 × 4
## # Groups: HR [9]
## HR slope se pvalue
## <dbl> <dbl> <dbl> <dbl>
## 1 0.4 0.734 0.208 1.54e- 3
## 2 0.5 0.566 0.110 3.02e- 6
## 3 0.6 0.412 0.0974 4.80e- 5
## 4 0.7 0.285 0.0705 7.93e- 5
## 5 0.8 0.365 0.0653 9.13e- 8
## 6 0.9 0.261 0.0751 6.85e- 4
## 7 1 0.512 0.0751 3.28e-10
## 8 1.1 0.454 0.0855 1.03e- 6
## 9 1.2 0.440 0.0801 1.07e- 6
This is the only command that correctly creates a tibble with four columns: HR, slope, se, and pvalue for each level of HR. The data frame must be passed to get_slope() using .. If you name the results of the do() command such as in the code do(slope = get_slope(.)), that will save all results in a single column called slope. If you forget group_by(), then the results will be a model on the data as a whole, rather than on the data stratified by home runs.
SaveSave your answer Show answer Submit You have used 1 of 2 attempts
1/1 point (graded) The output of a broom function is always what?
A data.frame A list A vector correct Explanation
The broom functions always output data.frame.
1/1 point (graded) You want to know whether the relationship between home runs and runs per game varies by baseball league. You create the following dataset:
dat <- Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(HR = HR/G,
R = R/G) %>%
select(lgID, HR, BB, R)
What code would help you quickly answer this question?
dat %>%
group_by(lgID) %>%
do(tidy(lm(R ~ HR, data = .), conf.int = T)) %>%
filter(term == "HR")
## # A tibble: 2 × 8
## # Groups: lgID [2]
## lgID term estimate std.error statistic p.value conf.low conf.high
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AL HR 1.90 0.0734 25.9 1.29e-95 1.75 2.04
## 2 NL HR 1.76 0.0671 26.2 1.16e-95 1.62 1.89
dat %>%
group_by(lgID) %>%
do(glance(lm(R ~ HR, data = .)))
## # A tibble: 2 × 13
## # Groups: lgID [2]
## lgID r.squared adj.r.squared sigma statistic p.value df logLik AIC
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AL 0.561 0.560 0.411 668. 1.29e-95 1 -278. 562.
## 2 NL 0.579 0.578 0.347 685. 1.16e-95 1 -180. 365.
## # … with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
## # nobs <int>
dat %>%
do(tidy(lm(R ~ HR, data = .), conf.int = T)) %>%
filter(term == "HR")
## # A tibble: 1 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 HR 1.84 0.0491 37.6 4.29e-195 1.75 1.94
dat %>%
group_by(lgID) %>%
do(mod = lm(R ~ HR, data = .))
## # A tibble: 2 × 2
## # Rowwise:
## lgID mod
## <fct> <list>
## 1 AL <lm>
## 2 NL <lm>
correct Answer Correct:Correct. This is a good application of the command tidy(), from the broom package. Explanation
dat %>%
group_by(lgID) %>%
do(tidy(lm(R ~ HR, data = .), conf.int = T)) %>%
filter(term == "HR")
This code is a good application of the command tidy(), from the broom package.
The glance() function provides data on model fit rather than on effect estimates and confidence intervals. If you forget the line group_by(lgID), your code will give you a single estimate for the entire dataset because you have not grouped the data by league ID.
dat %>%
group_by(lgID) %>%
do(mod = lm(R ~ HR, data = .))
This code gives get a data.frame with the column mod, which contains the linear model results. While it is possible to then extract effect estimates and confidence intervals from this model, it is not nearly as easy as using the tidy function.
SaveSave your answer Show answer Submit You have used 1 of 2 attempts
dat <- Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(HR = round(HR/G, 1),
BB = BB / G,
R = R / G) %>%
select(HR, BB, R) %>% # Why doing select() here?
filter(HR >= 0.4 & HR <=1.2)
get_slope <- function(data) {
fit <- lm(R ~ BB, data = data)
sum.fit <- summary(fit)
data.frame(slope = sum.fit$coefficients[2, "Estimate"],
se = sum.fit$coefficients[2, "Std. Error"],
pvalue = sum.fit$coefficients[2, "Pr(>|t|)"])
}
dat %>%
group_by(HR) %>%
do(get_slope(.))
## # A tibble: 9 × 4
## # Groups: HR [9]
## HR slope se pvalue
## <dbl> <dbl> <dbl> <dbl>
## 1 0.4 0.734 0.208 1.54e- 3
## 2 0.5 0.566 0.110 3.02e- 6
## 3 0.6 0.412 0.0974 4.80e- 5
## 4 0.7 0.285 0.0705 7.93e- 5
## 5 0.8 0.365 0.0653 9.13e- 8
## 6 0.9 0.261 0.0751 6.85e- 4
## 7 1 0.512 0.0751 3.28e-10
## 8 1.1 0.454 0.0855 1.03e- 6
## 9 1.2 0.440 0.0801 1.07e- 6
dat <- Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(HR = HR/G,
R = R/G) %>%
select(lgID, HR, BB, R)
dat %>%
group_by(lgID) %>%
do(tidy(lm(R ~ HR, data = .), conf.int = T)) %>%
filter(term == "HR")
## # A tibble: 2 × 8
## # Groups: lgID [2]
## lgID term estimate std.error statistic p.value conf.low conf.high
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AL HR 1.90 0.0734 25.9 1.29e-95 1.75 2.04
## 2 NL HR 1.76 0.0671 26.2 1.16e-95 1.62 1.89
dat %>%
group_by(lgID) %>%
do(glance(lm(R ~ HR, data = .)))
## # A tibble: 2 × 13
## # Groups: lgID [2]
## lgID r.squared adj.r.squared sigma statistic p.value df logLik AIC
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AL 0.561 0.560 0.411 668. 1.29e-95 1 -278. 562.
## 2 NL 0.579 0.578 0.347 685. 1.16e-95 1 -180. 365.
## # … with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
## # nobs <int>
dat %>%
do(tidy(lm(R ~ HR, data = .), conf.int = T)) %>%
filter(term == "HR")
## # A tibble: 1 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 HR 1.84 0.0491 37.6 4.29e-195 1.75 1.94
dat %>%
group_by(lgID) %>%
do(mod = lm(R ~ HR, data = .))
## # A tibble: 2 × 2
## # Rowwise:
## lgID mod
## <fct> <list>
## 1 AL <lm>
## 2 NL <lm>
Comprehension Check due Jun 12, 2022 00:29 AWST We have investigated the relationship between fathers’ heights and sons’ heights. But what about other parent-child relationships? Does one parent’s height have a stronger association with child height? How does the child’s gender affect this relationship in heights? Are any differences that we observe statistically significant?
The galton dataset is a sample of one male and one female child from each family in the GaltonFamilies dataset. The pair column denotes whether the pair is father and daughter, father and son, mother and daughter, or mother and son.
Create the galton dataset using the code below:
library(tidyverse)
library(HistData)
library(broom)
data("GaltonFamilies")
#set.seed(1) # if you are using R 3.5 or earlier
set.seed(1, sample.kind = "Rounding") # if you are using R 3.6 or later
## Warning in set.seed(1, sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
head(GaltonFamilies) # First thing is to look at your data, understand the structure of it
## family father mother midparentHeight children childNum gender childHeight
## 1 001 78.5 67.0 75.4 4 1 male 73.2
## 2 001 78.5 67.0 75.4 4 2 female 69.2
## 3 001 78.5 67.0 75.4 4 3 female 69.0
## 4 001 78.5 67.0 75.4 4 4 female 69.0
## 5 002 75.5 66.5 73.7 4 1 male 73.5
## 6 002 75.5 66.5 73.7 4 2 male 72.5
galton <- GaltonFamilies %>%
group_by(family, gender) %>%
sample_n(1) %>%
ungroup() %>%
gather(parent, parentHeight, father:mother) %>%
mutate(child = ifelse(gender == "female", "daughter", "son")) %>%
unite(pair, c("parent", "child"))
galton
## # A tibble: 710 × 8
## family midparentHeight children childNum gender childHeight pair
## <fct> <dbl> <int> <int> <fct> <dbl> <chr>
## 1 001 75.4 4 2 female 69.2 father_daughter
## 2 001 75.4 4 1 male 73.2 father_son
## 3 002 73.7 4 4 female 65.5 father_daughter
## 4 002 73.7 4 2 male 72.5 father_son
## 5 003 72.1 2 2 female 68 father_daughter
## 6 003 72.1 2 1 male 71 father_son
## 7 004 72.1 5 5 female 63 father_daughter
## 8 004 72.1 5 2 male 68.5 father_son
## 9 005 69.1 6 5 female 62.5 father_daughter
## 10 005 69.1 6 1 male 72 father_son
## # … with 700 more rows, and 1 more variable: parentHeight <dbl>
2.0/2.0 points (graded) Group by pair and summarize the number of observations in each group.
How many father-daughter pairs are in the dataset? 176 correct
How many mother-son pairs are in the dataset? 179 correct
SaveSave your answer Show answer Submit You have used 2 of 10 attempts
2.0/2.0 points (graded) Calculate the correlation coefficients for fathers and daughters, fathers and sons, mothers and daughters and mothers and sons.
Which pair has the strongest correlation in heights?
fathers and daughters fathers and sons mothers and daughters mothers and sons correct
Which pair has the weakest correlation in heights?
fathers and daughters fathers and sons mothers and daughters mothers and sons correct Show answer Submit You have used 2 of 2 attempts
Question 10 has two parts. The information here applies to both parts.
Use lm() and the broom package to fit regression lines for each parent-child pair type. Compute the least squares estimates, standard errors, confidence intervals and p-values for the parentHeight coefficient for each pair.
2/2 points (graded) What is the estimate of the father-daughter coefficient? 0.345 correct 0.345
Explanation
The estimate can be calculated using the following code:
library(broom)
galton %>%
group_by(pair) %>%
do(tidy(lm(childHeight ~ parentHeight, data = .), conf.int = TRUE)) %>%
filter(term == "parentHeight", pair == "father_daughter") %>%
pull(estimate)
For every 1-inch increase in mother’s height, how many inches does the typical son’s height increase? Give your answer as a number with no units.
0.381 correct 0.381
Explanation
The following code will give the decrease in height:
galton %>%
group_by(pair) %>%
do(tidy(lm(childHeight ~ parentHeight, data = .), conf.int = TRUE)) %>%
filter(term == "parentHeight", pair == "mother_son") %>%
pull(estimate)
SaveSave your answer Show answer Submit You have used 1 of 10 attempts
2/2 points (graded) Which sets of parent-child heights are significantly correlated at a p-value cut off of .05? Select ALL that apply.
father-daughter father-son mother-daughter mother-son correct
When considering the estimates, which of the following statements are true? Select ALL that apply.
All of the confidence intervals overlap each other. At least one confidence interval covers zero. The confidence intervals involving mothers’ heights are larger than the confidence intervals involving fathers’ heights. The confidence intervals involving daughters’ heights are larger than the confidence intervals involving sons’ heights. The data are consistent with inheritance of height being independent of the child’s gender. The data are consistent with inheritance of height being independent of the parent’s gender. correct Answer Correct: Correct. The confidence intervals all overlap. Correct. The std.error values are higher for mothers than fathers, resulting in larger confidence intervals. Correct. The confidence intervals overlap. Correct. The confidence intervals overlap. SaveSave your answer Show answer Submit You have used 1 of 3 attemptsSome problems have options such as save, reset, hints, or show answer. These options follow the Submit button.Correct (2/2 points)
galton %>%
#group_by(pair) %>%
filter(.$pair == "father_daughter") %>%
summarize(.$parentHeight)
## # A tibble: 176 × 1
## `.$parentHeight`
## <dbl>
## 1 78.5
## 2 75.5
## 3 75
## 4 75
## 5 75
## 6 74
## 7 74
## 8 74
## 9 74.5
## 10 74
## # … with 166 more rows
galton %>%
#group_by(pair) %>%
filter(.$pair == "mother_son") %>%
summarize(.$parentHeight)
## # A tibble: 179 × 1
## `.$parentHeight`
## <dbl>
## 1 67
## 2 66.5
## 3 64
## 4 64
## 5 58.5
## 6 68
## 7 62
## 8 67
## 9 67
## 10 66.5
## # … with 169 more rows
galton %>%
group_by(pair) %>%
do(tidy(lm(childHeight ~ parentHeight, data =.)))
## # A tibble: 8 × 6
## # Groups: pair [4]
## pair term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 father_daughter (Intercept) 40.1 4.16 9.65 6.50e-18
## 2 father_daughter parentHeight 0.345 0.0599 5.77 3.56e- 8
## 3 father_son (Intercept) 38.6 4.84 7.98 1.81e-13
## 4 father_son parentHeight 0.443 0.0700 6.33 1.94e- 9
## 5 mother_daughter (Intercept) 38.9 4.62 8.41 1.46e-14
## 6 mother_daughter parentHeight 0.394 0.0720 5.47 1.56e- 7
## 7 mother_son (Intercept) 44.9 5.02 8.94 4.96e-16
## 8 mother_son parentHeight 0.381 0.0784 4.86 2.59e- 6
# FD,345, FS,443, MD,394, MS, 381 # Why correlation changed between lm() function and just cor()
galton %>%
filter(pair == "father_daughter") %>%
summarise(cor(childHeight, parentHeight))
## # A tibble: 1 × 1
## `cor(childHeight, parentHeight)`
## <dbl>
## 1 0.401
# FD 401
galton %>%
filter(pair == "father_son") %>%
summarise(cor(childHeight, parentHeight))
## # A tibble: 1 × 1
## `cor(childHeight, parentHeight)`
## <dbl>
## 1 0.430
# FS 430
galton %>%
filter(pair == "mother_daughter") %>%
summarise(cor(childHeight, parentHeight))
## # A tibble: 1 × 1
## `cor(childHeight, parentHeight)`
## <dbl>
## 1 0.383
# MD 383
galton %>%
filter(pair == "mother_son") %>%
summarise(cor(childHeight, parentHeight))
## # A tibble: 1 × 1
## `cor(childHeight, parentHeight)`
## <dbl>
## 1 0.343
# MS 343
galton %>%
group_by(pair) %>%
#cor(childHeight, parentHeight) %>% # This line not working
summarize(cor = cor(parentHeight, childHeight))
## # A tibble: 4 × 2
## pair cor
## <chr> <dbl>
## 1 father_daughter 0.401
## 2 father_son 0.430
## 3 mother_daughter 0.383
## 4 mother_son 0.343
# Q9 lm() cor() outcome difference, please help ====================================================================================
question posted 40 minutes ago by john_hhu2020
I naturally thought using lm() function to solve it would be a good idea, but it seems not.
Comparison between lm() and cor() - the results are not corresponding. Is that because in lm() we used LSE and in cor() it calculate correlation based on the definition formula? Please enlight me
https://sites.utexas.edu/sos/guided/inferential/numeric/bivariate/cor/
Is that because of the Correlation measure if two variable are y = x relationship? As you mentioned the variable_a's correlation to itself is 1 (thus the slope should be 1? so the value of Correlation represent the difference to this dream line?). And the Slope in lm() function represent the influence variable can apply onto another? Please enlight me about this topic. Please please
And I Googled some statement saying this but I really don't think I understand it enough:
"Correlation look at trends shared between two variables, and regression look at relation between a predictor (independent variable) and a response (dependent) variable."
rho = 1/n * SUM(i_1-n) (x_i - mu_x)/Sigma_x * (y_i -mu_y)/Sigma_y
nickbirk (Staff)
about 9 hours ago
Vote for this post, there are currently 0 votes
Mark as Answer
The output of lm() does not display the correlation. Recall that the formula of the regression slope for an equation with just one predictor is:
slope = rho*(sd_y/sd_x)
So, the correlation is used to calculate the slope, but is not itself the slope. Indeed, it is possible that one equation may have a larger slope, but a smaller correlation, based on the value of the standard deviation of y. That is, if x has a very large standard deviation as compared to y, the slope will be quite small even with a high correlation. For example, if we had two variables with a correlation of 0.9, but with a standard deviation of y of 1 and a standard deviation of x of 100, our slope would be = 0.9*(1/100) = 0.009. The slope is small, but the correlation is still high.
One way to think about the difference between slope and correlation could be the relationship between height in centimeters and height in inches. Since one is directly computed from the other, the correlation will be exactly 1. However, if we wish to predict height in inches given height in centimeters, the formula would be (height in inches) = 0.3937*(height in centimeters). That is, the slope is less than 1, but the correlation is still 1.
Very thank you for the explanation, I think the center of your statement is based on the equation: slope = rho*(sd_y/sd_x). Whereas I don't know where does that comes from, or how can we prove it. Need help here
How about standalize the variables (mu_var = 0 and SD_var = 1), then the slope would be equal to correlation, I think we have learned that before. With the same SD and mu of var pairs, the equation becomes: rho = (1/n) SUM(i:1-n) x_i * y_i, ~ And ???
I just re-read the Section 1 Introduction to Regression / 1.2 Correlation / Correlation Coefficient part.
--The Correlation is the Avg of dot product SUM of var_i and mu_var distance divide by SD(var) of each var (x_i, y_i) pair, or rho = (1/n) SUM(i: 1-n) (x_i - mu_x)/SD_x * (y_i - mu_y)/SD_y.
--And if we think it as the SD away form mu_var which was mentioned in course statement, we can think it as the distance away from the center point of scatters, it can be further think as a measurement of un-roundness of the scatters.
--But I don't know where does the equation comes from: slope = rho*(sd_y/sd_x), and I don't think we have talked about this topic in the videos or coursework, where does it comes from and others
posted less than a minute ago by john_hhu2020
galton %>%
group_by(pair) %>%
#do(tidy(lm(childHeight ~ parentHeight, data = .)))
do(tidy(lm(childHeight ~ parentHeight, data = .), conf.int = TRUE))
## # A tibble: 8 × 8
## # Groups: pair [4]
## pair term estimate std.error statistic p.value conf.low conf.high
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 father_daughter (Int… 40.1 4.16 9.65 6.50e-18 31.9 48.3
## 2 father_daughter pare… 0.345 0.0599 5.77 3.56e- 8 0.227 0.464
## 3 father_son (Int… 38.6 4.84 7.98 1.81e-13 29.1 48.2
## 4 father_son pare… 0.443 0.0700 6.33 1.94e- 9 0.305 0.581
## 5 mother_daughter (Int… 38.9 4.62 8.41 1.46e-14 29.7 48.0
## 6 mother_daughter pare… 0.394 0.0720 5.47 1.56e- 7 0.252 0.536
## 7 mother_son (Int… 44.9 5.02 8.94 4.96e-16 35.0 54.8
## 8 mother_son pare… 0.381 0.0784 4.86 2.59e- 6 0.226 0.535
galton %>%
group_by(pair) %>%
do(tidy(lm(childHeight ~ parentHeight, data = .), conf.int = TRUE)) %>%
#filter(term == "parentHeight", pair == "mother_son") %>%
pull(estimate)
## [1] 40.134 0.345 38.612 0.443 38.858 0.394 44.878 0.381
galton %>%
group_by(pair) %>%
do(tidy(lm(childHeight ~ parentHeight, data = .), conf.int = TRUE)) %>%
filter(term == "parentHeight") %>%
ggplot(aes(pair, y = estimate, ymin = conf.low, ymax = conf.high)) +
geom_errorbar() +
geom_point()
library(tidyverse)
library(broom)
head(dat)
## lgID HR BB R
## 1 AL 0.914 581 4.24
## 2 AL 0.687 647 4.47
## 3 AL 0.847 550 4.69
## 4 NL 1.128 539 4.42
## 5 NL 1.026 423 4.61
## 6 AL 0.932 492 4.58
cor(dat$BB, dat$R)
## [1] 0.456
cor(dat$BB, dat$R)**2
## [1] 0.208
summary(lm(BB ~ R, data = dat))
##
## Call:
## lm(formula = BB ~ R, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -241.31 -42.68 2.63 46.00 188.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 255.74 16.41 15.6 <2e-16 ***
## R 61.17 3.73 16.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 70.4 on 1024 degrees of freedom
## Multiple R-squared: 0.208, Adjusted R-squared: 0.207
## F-statistic: 268 on 1 and 1024 DF, p-value: <2e-16
tidy(lm(BB ~ R, data = dat))
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 256. 16.4 15.6 2.61e-49
## 2 R 61.2 3.73 16.4 9.16e-54
In trying to answer how well bases on balls predict runs, data exploration led us to this model. Here, the data is approximately normal. And conditional distributions were also normal. Thus, we’re justified to pose a linear model like this. With Yi, the runs per game. X1, BB per game. And x2, home runs per game. To use lm here, we need to let it know that we have two predictor variables. So we use the plus (+) symbol as follows. Here’s the code that fits that multiple regression model. Now, we can use the tidy() function to see the nice summary. When we fit the model with only one variable without the adjustment, the estimated slopes were 0.735 and 1.844 for bases on ball and home runs, respectively. But note that when we fit the multivariate model, both these slopes go down with the bases on balls effect decreasing much more.
Now, if we want to construct a metric to pick players, we need to consider singles, doubles, and triples as well. Can we build a model that predicts runs based on all these outcomes? Now, we’re going to take somewhat of a leap of faith and assume that these five variables are jointly normal. This means that if we pick any one of them and hold the other four fixed, the relationship with the outcome–in this case, runs per game–is linear. And the slopes for this relationship do not depend on the other four values that were held constant. If this is true, if this model holds true, then a linear model for our data is the following. With x1, x2, x3, x4, x5 representing bases on balls per game, singles per game, doubles per game, triples per game, and home runs per game, respectively. Using lm, we can quickly find the least squared errors for the parameters using this relatively simple piece of code.
We can again use the tidy() function to see the [][coefficients], the standard errors, and confidence intervals. To see how well our metric actually predicts runs, we can predict the number of runs for each team in 2002 using the function predict() to make the plot. Note that we did not use the 2002 year to create this metric. We used data from years previous to 2002. And here is the plot. Our model does quite a good job, as demonstrated by the fact that points from the observed versus predicted plot fall close to the identity line. So instead of using batting average or just the number of home runs as a measure for picking players, we can use our fitted model to form a more informative metric that relates more directly to run production. [][*Specifically, to define a metric for player A, we imagine a team made up of players just like player A and use our fitted regression model to predict how many runs this team would produce8]. The formula would look like this. We’re basically sticking in the estimated coefficients (now I dont fully understand the coefficients and many related terms from summary() function or we learned in above courses, need to go back restudy it) into the regression formula. However, to define a player-specific metric, we have a bit more work to do.
Our challenge here is that we have derived the metrics for teams based on team-level summary statistics. For example, the home run value that is entered into the equation is home runs per game for the entire team. If you compute the home runs per game for a player, it will be much lower. As the total is accumulated by nine batters, not just one. Furthermore, if a player only plays part of the game and gets less opportunity than average, it’s still considered a game played. So this means that their rates will be lower than they should be. [][For players, a rate that takes into account opportunities is a per-plate-appearance rate]. To make the per-game team rate comparable to the per-plate-appearance player rate, we compute the average number of team plate appearances per game using this simple piece of code.
Now, we’re ready to use our metric. We’re going to compute the per-plate-appearance rates for players available in 2002 (you mean 2001 and before? as its not 2002 yet). But we’re going to use data from 1999-2001. Because remember, we are picking players in 2002. We don’t know what has happened yet. To [][avoid small sample artifacts], we’re going to filter players with few plate appearances. Here is the calculation of what we want to do in one long line of code using tidyverse.
So we fit our model. And we have player-specific metrics. The player-specific predicted runs computer here can be interpreted as the number of runs we would predict a team to score if this team was made up of just that player, if that player batted every single time. The distribution shows that there’s wide variability across players, as we can see here. To actually build the teams, we will need to know the players’ salaries, since we have a limited budget. Remember, we are pretending to be the Oakland A’s in 2002 with only a $40 million budget. We also need to know the players’ position. Because we’re going to need one shortstop, one second baseman, one third baseman, et cetera.
For this, we’re going to have to do a little bit of data wrangling to
combine information that is contained in different tables from the
Lehman library. OK, so here we go. We start by adding the 2002 salaries
for each player using this code
([][If we are really doing it, the only info we have would be salary in 2001 and per plat appearance in 2001 too]).
Next, we’re going to add the defensive position. This is a little bit
complicated, because players play more than one position each year. So
here, we’re going to pick the one position most played by each
player using the top_n function??????? [][******I assume this part
is under updating, which means we can rewrite new code here and really
doing it******]. And to make sure that we only pick one position in the
case of ties, we’re going to take the first row if there is a tie. We
also remove the OF position. Because this stands for outfielder, which
is a generalization of three positions–left field, center field, right
field. We also remove pitchers, as they don’t bat in the league that the
Athletics play. Here is the code that does that. Finally, we add their
names and last names so we know who we’re talking about. And here’s a
code that does that ([][********need to understand how the code follow
above instructions********]).
So now, we have a table with our predicted run statistic, some other statistics, the player’s name, their position, and their salary. If we look at the top 10 players based on our run production statistic, you’re going to recognize some names if you’re a baseball fan. Note the very high salaries of these players in the top 10. In fact, we see that players with high metrics have high salaries. We can see that by making a plot we do see some low-cost players with very high metrics. These would be great for our team. Unfortunately, these are likely young players that have not yet been able to negotiate a salary and are not going to be available in 2002. For example, the lowest earner on our top 10 list is Albert Pujols, who was a rookie in 2001. Here’s a plot with players that debuted before 1997. This removes all the young players. We can now search for good deals by looking at players that produce many more runs and others with similar salaries. We can use this table to decide what players to pick and keep our total salary below the $40 million Billy Beane had to work with. End of transcript. Skip to the start.
[][Textbook link]
This video corresponds to the textbook section on continuation of the Moneyball case study. https://rafalab.github.io/dsbook/linear-models.html#case-study-moneyball-continued
[][Code]
# linear regression with two variables
fit <- Teams %>%
filter(yearID %in% 1961:2001) %>%
mutate(BB = BB/G, HR = HR/G, R = R/G) %>%
lm(R ~ BB + HR, data = .)
tidy(fit, conf.int = TRUE)
# regression with BB, singles, doubles, triples, HR
fit <- Teams %>%
filter(yearID %in% 1961:2001) %>%
mutate(BB = BB / G,
singles = (H - X2B - X3B - HR) / G,
doubles = X2B / G,
triples = X3B / G,
HR = HR / G,
R = R / G) %>%
lm(R ~ BB + singles + doubles + triples + HR, data = .)
coefs <- tidy(fit, conf.int = TRUE)
coefs
# predict number of runs for each team in 2002 and plot
Teams %>%
filter(yearID %in% 2002) %>%
mutate(BB = BB/G,
singles = (H-X2B-X3B-HR)/G,
doubles = X2B/G,
triples =X3B/G,
HR=HR/G,
R=R/G) %>%
mutate(R_hat = predict(fit, newdata = .)) %>%
ggplot(aes(R_hat, R, label = teamID)) +
geom_point() +
geom_text(nudge_x=0.1, cex = 2) +
geom_abline()
# average number of team plate appearances per game
pa_per_game <- Batting %>% filter(yearID == 2002) %>%
group_by(teamID) %>%
summarize(pa_per_game = sum(AB+BB)/max(G)) %>%
pull(pa_per_game) %>%
mean
# compute per-plate-appearance rates for players available in 2002 using previous data
players <- Batting %>% filter(yearID %in% 1999:2001) %>%
group_by(playerID) %>%
mutate(PA = BB + AB) %>%
summarize(G = sum(PA)/pa_per_game,
BB = sum(BB)/G,
singles = sum(H-X2B-X3B-HR)/G,
doubles = sum(X2B)/G,
triples = sum(X3B)/G,
HR = sum(HR)/G,
AVG = sum(H)/sum(AB),
PA = sum(PA)) %>%
filter(PA >= 300) %>%
select(-G) %>%
mutate(R_hat = predict(fit, newdata = .))
# plot player-specific predicted runs
qplot(R_hat, data = players, geom = "histogram", binwidth = 0.5, color = I("black"))
# add 2002 salary of each player
players <- Salaries %>%
filter(yearID == 2002) %>%
select(playerID, salary) %>%
right_join(players, by="playerID")
# add defensive position
position_names <- c("G_p","G_c","G_1b","G_2b","G_3b","G_ss","G_lf","G_cf","G_rf")
tmp_tab <- Appearances %>%
filter(yearID == 2002) %>%
group_by(playerID) %>%
summarize_at(position_names, sum) %>%
ungroup()
pos <- tmp_tab %>%
select(position_names) %>%
apply(., 1, which.max)
players <- data_frame(playerID = tmp_tab$playerID, POS = position_names[pos]) %>%
mutate(POS = str_to_upper(str_remove(POS, "G_"))) %>%
filter(POS != "P") %>%
right_join(players, by="playerID") %>%
filter(!is.na(POS) & !is.na(salary))
# add players' first and last names
players <- Master %>%
select(playerID, nameFirst, nameLast, debut) %>%
mutate(debut = as.Date(debut)) %>%
right_join(players, by="playerID")
# top 10 players
players %>% select(nameFirst, nameLast, POS, salary, R_hat) %>%
arrange(desc(R_hat)) %>%
top_n(10)
# players with a higher metric have higher salaries
players %>% ggplot(aes(salary, R_hat, color = POS)) +
geom_point() +
scale_x_log10()
# remake plot without players that debuted after 1998
library(lubridate)
players %>% filter(year(debut) < 1998) %>%
ggplot(aes(salary, R_hat, color = POS)) +
geom_point() +
scale_x_log10()
library(tidyverse)
library(Lahman)
library(broom)
fit <- Teams %>%
filter(yearID %in% 1961:2001) %>%
mutate(BB = BB/G, HR = HR/G, R = R/G) %>%
lm(R ~ BB + HR, data = .)
tidy(fit, conf.int = T)
## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.74 0.0824 21.2 7.62e- 83 1.58 1.91
## 2 BB 0.387 0.0270 14.3 1.20e- 42 0.334 0.440
## 3 HR 1.56 0.0490 31.9 1.78e-155 1.47 1.66
library(tidyverse)
library(Lahman)
library(broom)
library(ggplot2)
fit <- Teams %>%
filter(yearID %in% 1961:2001) %>%
mutate(BB = BB/G,
singles = (H-X2B-X3B-HR)/G,
doubles = X2B/G,
triples = X3B/G,
HR = HR/G,
R = R/G) %>%
lm(R ~ BB + singles + doubles + triples + HR, data = .)
tidy(fit, conf.int = T)
## # A tibble: 6 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.77 0.0862 -32.1 4.76e-157 -2.94 -2.60
## 2 BB 0.371 0.0117 31.6 1.87e-153 0.348 0.394
## 3 singles 0.519 0.0127 40.8 8.67e-217 0.494 0.544
## 4 doubles 0.771 0.0226 34.1 8.44e-171 0.727 0.816
## 5 triples 1.24 0.0768 16.1 2.12e- 52 1.09 1.39
## 6 HR 1.44 0.0243 59.3 0 1.40 1.49
Teams %>%
filter(yearID %in% 2002) %>%
mutate(BB = BB/G,
singles = (H-X2B-X3B-HR)/G,
doubles = X2B/G,
triples = X3B/G,
HR = HR/G,
R = R/G) %>%
mutate(R_hat = predict(fit, newdata = .)) %>%
#select(R, R_hat) %>%
ggplot(aes(R_hat, R)) +
geom_point() +
geom_abline() +
geom_text(aes(label = teamID))
#geom_label(aes(label = teamID, size = NULL), nudge_y = 0.1)
so you need to make the per game team rate comparable to the per plat appearance player rate
why you are picking 2002? in this model, we dont have information of 2002
pa_per_game <- Batting %>%
filter(yearID == 2002) %>%
group_by(teamID) %>%
summarize(pa_per_game = sum(AB+BB)/max(G)) %>%
.$pa_per_game %>%
mean
pa_per_game
## [1] 38.7
Batting %>%
filter(yearID == 2002) %>%
group_by(teamID) %>%
summarize(pa_per_game = sum(AB+BB)/max(G))
## # A tibble: 30 × 2
## teamID pa_per_game
## <fct> <dbl>
## 1 ANA 38.9
## 2 ARI 40.2
## 3 ATL 38.3
## 4 BAL 36.9
## 5 BOS 39.6
## 6 CHA 39.6
## 7 CHN 39.7
## 8 CIN 37.4
## 9 CLE 39.5
## 10 COL 38.5
## # … with 20 more rows
#head(Batting)
# Do not run this, its meant to check the PA, why we are choosing 300 as limit
# Need to understand how baseball games runs, why choosing 300?
players <- Batting %>%
filter(yearID %in% 1999:2001) %>%
group_by(playerID) %>%
mutate(PA = BB + AB) %>%
summarize(G = sum(PA)/pa_per_game,
BB = sum(BB)/G,
singles = sum(H-X2B-X3B-HR)/G,
doubles = sum(X2B)/G,
triples = sum(X3B)/G,
HR = sum(HR)/G,
AVG = sum(H)/sum(AB),
PA = sum(PA))
players
## # A tibble: 1,703 × 9
## playerID G BB singles doubles triples HR AVG PA
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 abadan01 0.0258 0 0 0 0 0 0 1
## 2 abbotje01 8.85 3.28 6.21 2.03 0.113 0.565 0.252 343
## 3 abbotji01 0.542 0 3.69 0 0 0 0.0952 21
## 4 abbotku01 12.4 2.41 5.87 1.93 0.241 1.13 0.252 482
## 5 abbotpa01 0.232 0 8.61 4.31 0 0 0.333 9
## 6 abernbr01 8.54 3.16 6.91 1.99 0.117 0.585 0.270 331
## 7 abreubo01 52.3 6.03 5.91 2.39 0.478 1.45 0.313 2025
## 8 acevejo01 0.877 0 3.42 1.14 0 0 0.118 34
## 9 aceveju01 0.645 1.55 1.55 1.55 0 0 0.0833 25
## 10 adamste01 1.19 2.53 0.842 0.842 0 0 0.0465 46
## # … with 1,693 more rows
library(Lahman)
players <- Batting %>%
filter(yearID %in% 1999:2001) %>%
group_by(playerID) %>%
mutate(PA = BB + AB) %>%
summarize(G = sum(PA)/pa_per_game,
BB = sum(BB)/G,
singles = sum(H-X2B-X3B-HR)/G,
doubles = sum(X2B)/G,
triples = sum(X3B)/G,
HR = sum(HR)/G,
AVG = sum(H)/sum(AB),
PA = sum(PA)) %>%
filter(PA >= 300) %>%
select(-G) %>%
mutate(R_hat = predict(fit, newdata = .))
head(players)
## # A tibble: 6 × 9
## playerID BB singles doubles triples HR AVG PA R_hat
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
## 1 abbotje01 3.28 6.21 2.03 0.113 0.565 0.252 343 4.20
## 2 abbotku01 2.41 5.87 1.93 0.241 1.13 0.252 482 4.59
## 3 abernbr01 3.16 6.91 1.99 0.117 0.585 0.270 331 4.52
## 4 abreubo01 6.03 5.91 2.39 0.478 1.45 0.313 2025 7.08
## 5 agbaybe01 4.53 6.31 1.89 0.223 1.30 0.284 1044 5.80
## 6 alexama02 2.26 6.39 1.48 0.492 0.393 0.240 394 3.71
players %>%
ggplot(aes(R_hat)) +
geom_histogram(bins = 20, color = "black")
players <- Salaries %>%
filter(yearID == 2002) %>%
select(playerID, salary) %>%
right_join(players, by = "playerID")
head(players)
## playerID salary BB singles doubles triples HR AVG PA R_hat
## 1 anderga01 5000000 1.63 6.91 2.20 0.134 1.608 0.292 2024 5.61
## 2 eckstda01 280000 2.67 8.31 1.61 0.124 0.248 0.285 625 4.29
## 3 erstada01 6250000 3.25 7.43 1.80 0.225 0.882 0.291 2065 5.24
## 4 fabrejo01 500000 2.57 6.18 1.25 0.278 0.556 0.228 558 3.50
## 5 fullmbr01 4000000 2.42 6.00 2.53 0.134 1.586 0.282 1441 5.65
## 6 gilbe01 400000 2.82 6.47 1.86 0.320 0.897 0.266 605 4.76
OF positions are outer fields players, they must runs a lot
players <- Fielding %>%
filter(yearID == 2002) %>%
filter(!POS %in% c("OF", "P")) %>%
group_by(playerID) %>%
top_n(1, G) %>%
filter(row_number(G) == 1) %>%
ungroup() %>%
select(playerID, POS) %>%
right_join(players, by = "playerID") %>%
filter(!is.na(POS) & !is.na(salary))
head(players)
## # A tibble: 6 × 11
## playerID POS salary BB singles doubles triples HR AVG PA R_hat
## <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
## 1 abernbr01 2B 215000 3.16 6.91 1.99 0.117 0.585 0.270 331 4.52
## 2 alfoned01 3B 6200000 4.81 6.31 2.15 0.0625 1.44 0.293 1860 6.10
## 3 alicelu01 2B 800000 3.55 7.16 1.65 0.387 0.419 0.273 1201 4.62
## 4 alomaro01 2B 7939664 4.73 7.20 2.22 0.331 1.23 0.323 1991 6.62
## 5 alomasa02 C 2500000 1.66 7.39 1.92 0.156 0.884 0.279 745 4.64
## 6 anderma02 2B 400000 2.28 6.93 2.05 0.225 0.546 0.268 1207 4.33
players <- People %>%
select(playerID, nameFirst, nameLast, debut) %>%
right_join(players, by = "playerID")
head(players)
## playerID nameFirst nameLast debut POS salary BB singles doubles
## 1 abernbr01 Brent Abernathy 2001-06-25 2B 215000 3.16 6.91 1.99
## 2 alfoned01 Edgardo Alfonzo 1995-04-26 3B 6200000 4.81 6.31 2.15
## 3 alicelu01 Luis Alicea 1988-04-23 2B 800000 3.55 7.16 1.65
## 4 alomaro01 Roberto Alomar 1988-04-22 2B 7939664 4.73 7.20 2.22
## 5 alomasa02 Sandy Alomar 1988-09-30 C 2500000 1.66 7.39 1.92
## 6 anderma02 Marlon Anderson 1998-09-08 2B 400000 2.28 6.93 2.05
## triples HR AVG PA R_hat
## 1 0.1171 0.585 0.270 331 4.52
## 2 0.0625 1.437 0.293 1860 6.10
## 3 0.3871 0.419 0.273 1201 4.62
## 4 0.3308 1.226 0.323 1991 6.62
## 5 0.1560 0.884 0.279 745 4.64
## 6 0.2247 0.546 0.268 1207 4.33
players %>%
select(nameFirst, nameLast, POS, salary, R_hat) %>%
arrange(desc(R_hat)) %>%
top_n(10)
## Selecting by R_hat
## nameFirst nameLast POS salary R_hat
## 1 Todd Helton 1B 5000000 8.23
## 2 Jason Giambi 1B 10428571 7.99
## 3 Albert Pujols 3B 600000 7.54
## 4 Nomar Garciaparra SS 9000000 7.51
## 5 Jeff Bagwell 1B 11000000 7.48
## 6 Alex Rodriguez SS 22000000 7.44
## 7 Carlos Delgado 1B 19400000 7.37
## 8 Rafael Palmeiro 1B 8712986 7.26
## 9 Mike Piazza C 10571429 7.16
## 10 Jim Thome 1B 8000000 7.16
#Fielding
#head(People)
players %>%
ggplot(aes(salary, R_hat, color = POS)) +
geom_point() +
scale_x_log10()
players %>%
filter(debut < 1998) %>%
ggplot(aes(salary, R_hat, color = POS)) +
geom_point() +
scale_x_log10()
library(tidyverse)
library(broom) # tidy() function
library(Lahman)
Teams_small <- Teams %>%
filter(yearID %in% 1961:2001)
small_dat <- Teams_small %>%
mutate(R_per_game = R/G, BB_per_game = BB/G, HR_per_game = round(HR/G, digits = 1)) %>%
select(HR_per_game, R_per_game, BB_per_game) %>%
filter(HR_per_game > 0.5 & HR_per_game < 1.2)
small_dat %>%
group_by(HR_per_game) %>%
#do(tidy(lm(R_per_game ~ BB_per_game, data = .))) ### Can we apply filter() after this? Say we want Intercept only
do(tidy(lm(R_per_game ~ BB_per_game, data = .))) %>%
filter(., term != "BB_per_game")
## # A tibble: 6 × 6
## # Groups: HR_per_game [6]
## HR_per_game term estimate std.error statistic p.value
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 0.6 (Intercept) 2.53 0.305 8.32 2.43e-13
## 2 0.7 (Intercept) 3.21 0.225 14.3 1.49e-30
## 3 0.8 (Intercept) 3.07 0.213 14.4 5.40e-31
## 4 0.9 (Intercept) 3.54 0.251 14.1 8.77e-29
## 5 1 (Intercept) 2.88 0.256 11.3 6.62e-21
## 6 1.1 (Intercept) 3.21 0.300 10.7 6.46e-17
# \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
small_dat %>%
group_by(HR_per_game) %>%
do(data.frame(name = names(lm(R_per_game ~ BB_per_game, data = .)$coefficient[1]),
value = tidy(lm(R_per_game ~ BB_per_game, data = .))$estimate[1],
se = tidy(lm(R_per_game ~ BB_per_game, data = .))$std.error[1]))
## # A tibble: 6 × 4
## # Groups: HR_per_game [6]
## HR_per_game name value se
## <dbl> <chr> <dbl> <dbl>
## 1 0.6 (Intercept) 2.53 0.305
## 2 0.7 (Intercept) 3.21 0.225
## 3 0.8 (Intercept) 3.07 0.213
## 4 0.9 (Intercept) 3.54 0.251
## 5 1 (Intercept) 2.88 0.256
## 6 1.1 (Intercept) 3.21 0.300
# //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
par(mfrow=c(2,2))
plot(lm(R_per_game ~ BB_per_game, data = small_dat))
summary(lm(R_per_game ~ BB_per_game, data = small_dat))
##
## Call:
## lm(formula = R_per_game ~ BB_per_game, data = small_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.186 -0.281 -0.001 0.307 1.439
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.5651 0.1193 21.5 <2e-16 ***
## BB_per_game 0.5312 0.0362 14.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.427 on 799 degrees of freedom
## Multiple R-squared: 0.212, Adjusted R-squared: 0.211
## F-statistic: 215 on 1 and 799 DF, p-value: <2e-16
A way to actually pick the players for the team can be done using what computer scientists call [][linear programming]. Although we don’t go into this topic in detail in this course, we include the code anyway:
library(reshape2)
library(lpSolve)
library(tidyverse)
players <- players %>%
filter(debut <= "1997-01-01" & debut > "1988-01-01")
constraint_matrix <- acast(players, POS ~ playerID, fun.aggregate = length)
npos <- nrow(constraint_matrix)
constraint_matrix <- rbind(constraint_matrix, salary = players$salary)
constraint_dir <- c(rep("==", npos), "<=")
constraint_limit <- c(rep(1, npos), 50*10^6)
lp_solution <- lp("max", players$R_hat,
constraint_matrix, constraint_dir, constraint_limit,
all.int = TRUE)
# This algorithm chooses these 9 players:
our_team <- players %>%
filter(lp_solution$solution == 1) %>%
arrange(desc(R_hat))
our_team %>%
select(nameFirst, nameLast, POS, salary, R_hat)
# We note that these players all have above average BB and HR rates while the same is not true for singles.
my_scale <- function(x) (x - median(x))/mad(x)
players %>%
mutate(BB = my_scale(BB),
singles = my_scale(singles),
doubles = my_scale(doubles),
triples = my_scale(triples),
HR = my_scale(HR),
AVG = my_scale(AVG),
R_hat = my_scale(R_hat)) %>%
filter(playerID %in% our_team$playerID) %>%
select(nameFirst, nameLast, BB, singles, doubles, triples, HR, AVG, R_hat) %>%
arrange(desc(R_hat))
Since the 1980s sabermetricians have used a summary statistic different from batting average to evaluate players. They realized walks were important, and that doubles, triples, and home runs should be weighted much more than singles, and proposed the following metric. They call this on-base-percentage plus slugging percentage, or OPS. Today, this statistic has caught on, and you see it in ESPN and other sports networks. Although the sabermetricians are probably not using regression, this metric is impressively close to what one gets with regression to the summary statistic that we created. Here is the plot. They’re very correlated.
[][Key point]
The on-base-percentage plus slugging percentage (OPS) metric is:
The equasion you see in below image
[][(This course is not long, but its hell important, do read it
again and again. I think the instructor want to say correlation is not
causation potentially here, as not all the 2014 dots are
under save level of 2013 dot )]
Wikipedia defines the [][sophomore slump] in the following way. A sophomore slump or sophomore jinx or sophomore jitters refers to an instance in which a second, or sophomore, effort fails to live up to the standard of the first effort. It is commonly used to refer to the apathy of students–second year of high school, college, university–the performance of athletes– second season of play–singers/bands– second album– television shows– second season–and movies– sequels or prequels. We hear about the sophomore slump often in Major League Baseball. This is because in Major League Baseball, the Rookie of the Year–this is an award that’s given to the first year player that is judged to have performed the best–usually does not perform as well during their second year.
Therefore they call this the sophomore slump. Know, for example, that in a recent Fox Sports article they asked, will MLB’s tremendous rookie class of 2015 suffer a sophomore slump. Now does the data confirm the existence of a sophomore slump? ([][If you are Data Scientist, better preparing such questions, CEO and CFO will definitelly asking you such questions, and when you are not 100% prepared]) Let’s take a look and examine the data for batting averages to see if the observation holds true.
The data is available in the Lehman Library, but we have to do some work to create a table with the statistics for all the rookies of the year. Let’s go through them (Harvard instituters helps you go through different questions and solving them one by one, do learnt something, dont waste they kindness and time). First, we create a table with player ID, their names and their most played position, using this code. Now we will create a table with only the Rookie of the Year Award winners and add their batting statistics. We’re going to filter out pitchers since pitchers are not given awards for batting. And we’re going to focus on offense. Specifically, we’ll focus on batting average since it is the summary that most pundits talk about when discussing the sophomore slump.
So we write this piece of code to do this. Now we’ll keep only the rookie and sophomore seasons and remove players that did not play a sophomore season. And remember, now we’re only looking at players that won the Rookie of the Year Award. This code achieves what we want. Finally, we will use the [][**spread() function*] (do you remember how we wrote a function to do such job? and do you recall how we perform such task under Pandas package, need to restudy such topic) to have one column for the rookie and another column for the sophomore years’ batting averages. For that we use this simple line of code.
Now we can see the top performers in their first year. These are the Rookie of the Year Award winners. And we’re showing their rookie season batting average and their sophomore season batting average. Look closely and you will see the sophomore slump (what you saw is likely not true, say here the instructor applied > comparison without a fixing boundary setting here, which is absolutely not properly, and maybe the fittling are not appropriate as well, go through all the code and doing your own version later). It definitely appears to be real. In fact, the proportion of players that have a lower batting average their sophomore years is 68%. So is it jitters? Is it a jinx?
[][To answer this question, let’s turn our attention to all players] (comparing if this is full player level phenomenal). We’re going to look at the 2013 season and 2014 season. And we’re going to look at players that batted at least 130 times. This is a minimum needed to win the Rookie of the Year. We’re going to perform a similar operation as we did before to construct this data set. Here is the code. Now let’s look at the top performers of 2013 and then look at their performance in 2014. [][Note that the same pattern arises when we look at the top performers]. Batting averages go down for the top performers. But these are not rookies. So this can’t be explained with a sophomore slump.
Also know what happens to the worst performers of 2013. Here they are. Their batting averages go up in their second season in 2014. Is this some sort of reverse sophomore slump? It is not. There is no such thing as a sophomore slump. This is all explained with a simple statistical fact. [][*The correlation of performance in two separate years is high but not perfect] (what will happened if we saw a perfect correlation here? all dot dropped on the regression line?). Here is the data for 2013 performance and 2014 performance. You can see it’s correlated. But it’s not perfectly correlated. The correlation is 0.46. The data look very much like a [][bivariate normal distribution] (american football shape, do you remember), which means that if we were to predict the 2014 batting average, let’s call it y, for any given player that had a 2013 batting average of x, we would use the regression equation, which would be this.
Because a correlation is not perfect, regression tells us that on average, we expect high performers from 2013 to do a little bit worse in 2014. This regression to the mean. It’s not a jinx. It’s just due to chance. The rookies of the year are selected from the top values of x. So it is expected that their y will regress to the mean.
[][Textbook link]
This video corresponds to the textbook section on regression fallacy. https://rafalab.github.io/dsbook/linear-models.html#the-regression-fallacy
[][Key points]
Regression can bring about errors in reasoning, especially when interpreting individual observations.
The example showed in the video demonstrates that the "sophomore slump" observed in the data is caused by regressing to the mean.
Code The code to create a table with player ID, their names, and their most played position:
library(Lahman) playerInfo <- Fielding %>% group_by(playerID) %>% arrange(desc(G)) %>% slice(1) %>% ungroup %>% left_join(Master, by=“playerID”) %>% select(playerID, nameFirst, nameLast, POS)
The code to create a table with only the ROY award winners and add their batting statistics:
ROY <- AwardsPlayers %>% filter(awardID == “Rookie of the Year”) %>% left_join(playerInfo, by=“playerID”) %>% rename(rookie_year = yearID) %>% right_join(Batting, by=“playerID”) %>% mutate(AVG = H/AB) %>% filter(POS != “P”)
The code to keep only the rookie and sophomore seasons and remove players who did not play sophomore seasons:
ROY <- ROY %>% filter(yearID == rookie_year | yearID == rookie_year+1) %>% group_by(playerID) %>% mutate(rookie = ifelse(yearID == min(yearID), “rookie”, “sophomore”)) %>% filter(n() == 2) %>% ungroup %>% select(playerID, rookie_year, rookie, nameFirst, nameLast, AVG)
The code to use the spread function to have one column for the rookie and sophomore years batting averages:
ROY <- ROY %>% spread(rookie, AVG) %>% arrange(desc(rookie)) ROY
#> # A tibble: 99 x 6 #> playerID rookie_year nameFirst
nameLast rookie sophomore #>
The code to calculate the proportion of players who have a lower batting average their sophomore year:
mean(ROY\(sophomore - ROY\)rookie <= 0)
#> [1] 0.677
The code to do the similar analysis on all players that played the 2013 and 2014 seasons and batted more than 130 times (minimum to win Rookie of the Year):
two_years <- Batting %>% filter(yearID %in% 2013:2014) %>%
group_by(playerID, yearID) %>% filter(sum(AB) >= 130) %>%
summarize(AVG = sum(H)/sum(AB)) %>% ungroup %>% spread(yearID,
AVG) %>% filter(!is.na(2013) &
!is.na(2014)) %>% left_join(playerInfo, by=“playerID”)
%>% filter(POS!=“P”) %>% select(-POS) %>%
arrange(desc(2013)) %>% select(nameFirst, nameLast,
2013, 2014) two_years
#> # A tibble: 312 x 4 #> nameFirst nameLast 2013
2014 #>
The code to see what happens to the worst performers of 2013:
arrange(two_years, 2013)
#> # A tibble: 312 x 4 #> nameFirst nameLast 2013
2014 #>
#> 4 Melvin Upton 0.184 0.208 #> 5 Adam Rosales 0.190 0.262 #>
6 Aaron Hicks 0.192 0.215 #> # … with 306 more rows
The code to see the correlation for performance in two separate years:
qplot(2013, 2014, data = two_years)
summarize(two_years, cor(2013,2014)) #>
# A tibble: 1 x 1 #> cor(\2013`, `2014`)` #>
library(Lahman)
playerinfo <- Fielding %>%
group_by(playerID) %>%
arrange(desc(G)) %>%
slice(1) %>%
#ungroup() %>%
left_join(People, by="playerID") %>%
select(playerID, nameFirst, nameLast, POS)
head(playerinfo)
## # A tibble: 6 × 4
## # Groups: playerID [6]
## playerID nameFirst nameLast POS
## <chr> <chr> <chr> <chr>
## 1 aardsda01 David Aardsma P
## 2 aaronha01 Hank Aaron OF
## 3 aaronto01 Tommie Aaron 1B
## 4 aasedo01 Don Aase P
## 5 abadan01 Andy Abad 1B
## 6 abadfe01 Fernando Abad P
ROY <- AwardsPlayers %>%
filter(awardID == "Rookie of the Year") %>%
left_join(playerinfo, by = "playerID") %>%
rename(rookie_year = yearID) %>%
right_join(Batting, by = "playerID") %>%
mutate(AVG = H/AB) %>%
filter(POS != "P")
head(ROY)
## playerID awardID rookie_year lgID.x tie notes nameFirst nameLast
## 1 robinja02 Rookie of the Year 1947 ML <NA> <NA> Jackie Robinson
## 2 robinja02 Rookie of the Year 1947 ML <NA> <NA> Jackie Robinson
## 3 robinja02 Rookie of the Year 1947 ML <NA> <NA> Jackie Robinson
## 4 robinja02 Rookie of the Year 1947 ML <NA> <NA> Jackie Robinson
## 5 robinja02 Rookie of the Year 1947 ML <NA> <NA> Jackie Robinson
## 6 robinja02 Rookie of the Year 1947 ML <NA> <NA> Jackie Robinson
## POS yearID stint teamID lgID.y G AB R H X2B X3B HR RBI SB CS BB SO
## 1 2B 1947 1 BRO NL 151 590 125 175 31 5 12 48 29 NA 74 36
## 2 2B 1948 1 BRO NL 147 574 108 170 38 8 12 85 22 NA 57 37
## 3 2B 1949 1 BRO NL 156 593 122 203 38 12 16 124 37 NA 86 27
## 4 2B 1950 1 BRO NL 144 518 99 170 39 4 14 81 12 NA 80 24
## 5 2B 1951 1 BRO NL 153 548 106 185 33 7 19 88 25 8 79 27
## 6 2B 1952 1 BRO NL 149 510 104 157 17 3 19 75 24 7 106 40
## IBB HBP SH SF GIDP AVG
## 1 NA 9 28 NA 5 0.297
## 2 NA 7 8 NA 7 0.296
## 3 NA 8 17 NA 22 0.342
## 4 NA 5 10 NA 11 0.328
## 5 NA 9 6 NA 10 0.338
## 6 NA 14 6 NA 16 0.308
ROY <- ROY %>%
filter(yearID == rookie_year | yearID == rookie_year+1) %>%
group_by(playerID) %>%
mutate(rookie = ifelse(yearID == min(yearID), "Rookie", "Sophomore")) %>%
filter(n() == 2) %>%
ungroup() %>%
select(playerID, rookie_year, rookie, nameFirst, nameLast, AVG)
ROY
## # A tibble: 212 × 6
## playerID rookie_year rookie nameFirst nameLast AVG
## <chr> <int> <chr> <chr> <chr> <dbl>
## 1 robinja02 1947 Rookie Jackie Robinson 0.297
## 2 robinja02 1947 Sophomore Jackie Robinson 0.296
## 3 darkal01 1948 Rookie Al Dark 0.322
## 4 darkal01 1948 Sophomore Al Dark 0.276
## 5 sievero01 1949 Rookie Roy Sievers 0.306
## 6 sievero01 1949 Sophomore Roy Sievers 0.238
## 7 dropowa01 1950 Rookie Walt Dropo 0.322
## 8 dropowa01 1950 Sophomore Walt Dropo 0.239
## 9 jethrsa01 1950 Rookie Sam Jethroe 0.273
## 10 jethrsa01 1950 Sophomore Sam Jethroe 0.280
## # … with 202 more rows
# Now Think How to make this happens, and do your own test, and master it ############################################################
library(tidyr)
ROY <- ROY %>%
spread(rookie, AVG) %>%
arrange(desc(Rookie)) # Because we've pulled value as variables now
ROY
## # A tibble: 106 × 6
## playerID rookie_year nameFirst nameLast Rookie Sophomore
## <chr> <int> <chr> <chr> <dbl> <dbl>
## 1 mccovwi01 1959 Willie McCovey 0.354 0.238
## 2 suzukic01 2001 Ichiro Suzuki 0.350 0.321
## 3 bumbral01 1973 Al Bumbry 0.337 0.233
## 4 lynnfr01 1975 Fred Lynn 0.331 0.314
## 5 pujolal01 2001 Albert Pujols 0.329 0.314
## 6 troutmi01 2012 Mike Trout 0.326 0.323
## 7 braunry02 2007 Ryan Braun 0.324 0.285
## 8 olivato01 1964 Tony Oliva 0.323 0.321
## 9 hargrmi01 1974 Mike Hargrove 0.323 0.303
## 10 darkal01 1948 Al Dark 0.322 0.276
## # … with 96 more rows
ROY <- ROY %>%
mutate(flag = ifelse(Sophomore-Rookie>=0.005, 1, 0))
sum(ROY$flag)/length(ROY$Rookie)
## [1] 0.236
two_years <- Batting %>%
filter(yearID %in% 2013:2014) %>%
group_by(playerID, yearID) %>%
filter(sum(AB)>130) %>%
summarize(AVG=sum(H)/sum(AB)) %>%
ungroup %>%
spread(yearID, AVG) %>%
filter(!is.na(`2013`) & !is.na(`2014`)) %>% # using ` to referring column name is kind of like SQL
left_join(playerinfo, by="playerID") %>%
filter(POS != "P") %>%
select(-POS) %>%
arrange(desc('2013')) %>%
select(nameFirst, nameLast, `2013`, `2014`)
## `summarise()` has grouped output by 'playerID'. You can override using the
## `.groups` argument.
two_years
## # A tibble: 311 × 4
## nameFirst nameLast `2013` `2014`
## <chr> <chr> <dbl> <dbl>
## 1 Dustin Ackley 0.253 0.245
## 2 Matt Adams 0.284 0.288
## 3 Yonder Alonso 0.281 0.240
## 4 Jose Altuve 0.283 0.341
## 5 Pedro Alvarez 0.233 0.231
## 6 Alexi Amarista 0.236 0.239
## 7 Elvis Andrus 0.271 0.263
## 8 Nori Aoki 0.286 0.285
## 9 Oswaldo Arcia 0.251 0.231
## 10 Nolan Arenado 0.267 0.287
## # … with 301 more rows
two_years %>%
ggplot(aes(`2013`, `2014`)) +
geom_point(alpha=0.3)
summarise(two_years, cor(`2013`, `2014`))
## # A tibble: 1 × 1
## `cor(\`2013\`, \`2014\`)`
## <dbl>
## 1 0.468
# |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||\
# Weve see a high "correlation" (share we call it that name?) here, but there are not
+++ be curiosity +++, does the data confirm the existance of a sophomore slump ???
since batting average is what people talked about when discussing sophomore slump
we see same patterns here batting averages slump in these players, but ther are not reworded as rookies of the year.png
correlation for performance in 2 separate years is high but not perfect.png
Up until now, all our linear regression examples have been applied to two or more random variables. We assume the pairs are bivariate normal and use this to motivate a linear model. This approach covers most of real life examples where linear regression is used. [][The other major application comes from measurement error models]. In these applications, it is common to have a nonrandom covariates, such as time. And randomness is introduced from measurement error, rather than sampling or natural variability. To understand these models, we’re going to use a motivation example related to physics.
Imagine you are Galileo in the 16th century trying to describe the velocity of a falling object. An assistant climbs the Tower of Pisa and drops a ball. While several other assistants record the position at different times. The falling object data set contains an example of what that data would look like. The assistant hands the data to Galileo and this is what he sees. He uses ggplot to make a plot. Here we see the distance in meters that has dropped on the y-axis and time on the x-axis. Galileo does not know the exact equation, but from data exploration, by looking at the plot, he deduces that the position should follow a parabola, which we can write like this. The data does not fall exactly on a [][parabola], but Galileo knows that this is due to measurement error.
His helpers make mistakes when measuring the distance the ball has fallen. To account for this, we write this model. Here, y represents the distance the ball is dropped in meters. Xi represents time in seconds. And epsilon represents measurement error. [][The measurement error is assumed to be random, independent from each other and having the same distribution from each eye]. We also assume that there is no bias, which means that the expected value of epsilon is 0. Note that this is a linear model because it is a linear combination of known quantities. X and x squared are known and unknown parameters, the betas. Unlike our previous example, the x’s are fixed quantities. This is just time. We’re not conditioning. Now to pose a new physical theory and start making predictions about other falling objects, Galileo needs actual numbers, rather than the unknown parameters.
The least squares estimates seem like a reasonable approach. So how do we find the least squares estimates? [][Note that the LSE calculations do not require the errors to be approximately normal]. The lm( ) function will find the betas that minimize the residual sum of squares, which is what we want. So we use this code to obtain our estimated parameters. To check if the estimated parabola fits the data, the broom function augment( ) lets us do this easily. Using this code, we can make the following plot. Note that the predicted values go right through the points. Now, thanks to my high school physics teacher, I know that the equation for the trajectory of a falling object is the following. With h0 and v0, the starting height and starting velocity respectively. The data we use follow this equation and added measurement error to simulate and observations.
Dropping the ball, that means the starting velocity is 0 because we start just by dropping it from the Tower of Pisa, which has a height of about 56.67 meters. These known quantities are consistent with the parameters that we estimated, which we can see using the tidy function. Here they are. The Tower of Pisa height is within the confidence interval for beta 0. The initial velocity of 0 is in the confidence interval for beta 1. [][Note that the p value is larger than 0.05, which means we wouldn’t reject the hypothesis that the starting velocity is 0]. And finally, the acceleration constant is in the confidence intervals for negative 2 times beta 2.
measurement error
we also assume there is no bias, which means the expected value of epsilion is 0.png
[][the LSE seems like a reasonal approach, NOTE the LSE
calculation does not require the error to be approximately normal]
[][the lm function will find the betas that minimize the residual
sum of squares.png]
library(ggplot2)
library(Lahman)
library(tidyverse)
library(dslabs)
falling_object <- rfalling_object()
falling_object %>%
ggplot(aes(time, observed_distance)) +
geom_point() +
ylab("Distance in meters") +
xlab("Time in seconds")
library(broom)
head(falling_object)
## time distance observed_distance
## 1 0.00 55.9 55.3
## 2 0.25 55.6 54.2
## 3 0.50 54.6 54.6
## 4 0.75 53.1 52.7
## 5 1.00 51.0 50.4
## 6 1.25 48.2 49.2
fit <- falling_object %>%
mutate(time_sq = time^2, y = observed_distance) %>%
lm(y ~ time + time_sq, data =.)
tidy(fit)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 55.0 0.434 127. 9.15e-19
## 2 time 0.888 0.620 1.43 1.80e- 1
## 3 time_sq -5.08 0.184 -27.7 1.61e-11
augment(fit) %>%
ggplot() +
geom_point(aes(time, y)) +
geom_line(aes(time, .fitted))
Comprehension Check due Jun 12, 2022 00:29 AWST ## Question 1 1/1 point (graded) What is the final linear model (in the video “Building a Better Offensive Metric for Baseball”) we used to predict runs scored per game?
lm(R ~ BB + HR)
lm(HR ~ BB + singles + doubles + triples)
lm(R ~ BB + singles + doubles + triples + HR)
lm(R ~ singles + doubles + triples + HR)
correct Answer Correct: Correct.
Explanation lm(R ~ BB + singles + doubles + triples + HR) is the only one of the models above that predicts runs scored based on all of the following: BBs, singles, doubles, triples, and HRs. You have used 1 of 2 attempts Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Answers are displayed within the problem
1/1 point (graded)
We want to estimate runs per game scored by individual players, not just by teams. What summary metric do we calculate to help estimate this?
Look at the code from the video “Building a Metter Offensive Metric for Baseball” for a hint:
pa_per_game <- Batting %>%
filter(yearID == 2002) %>% group_by(teamID) %>% summarize(pa_per_game = sum(AB+BB)/max(G)) %>% .$pa_per_game %>% mean
The summary metric used is: pa_per_game: the mean number of plate appearances per team per game for each team pa_per_game: the mean number of plate appearances per game for each player pa_per_game: the number of plate appearances per team per game, averaged across all teams correct
Explanation pa_per_game is the number of plate appearances per team per game averaged across all teams. We initially calculated the pa_per_game grouped by teams but then took the means across all teams to get one summary metric. You have used 1 of 1 attempt Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Answers are displayed within the problem
1/1 point (graded)
Imagine you have two teams. Team A is comprised of batters who, on average, get two bases on balls, four singles, one double, no triples, and one home run. Team B is comprised of batters who, on average, get one base on balls, six singles, two doubles, one triple, and no home runs. Which team scores more runs, as predicted by our model? Team A Team B Tie Impossible to know correct Answer Correct: Correct.
Explanation
By using the coefficients from the linear model to predict the number of runs scored by each team, you find that Team B is expected to score more runs on average. You have used 1 of 2 attempts Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Answers are displayed within the problem
1/1 point (graded) The on-base-percentage plus slugging percentage (OPS) metric gives the most weight to: Singles Doubles Triples Home Runs correct Answer Correct: Correct.
Explanation
By looking at the equation for OPS, you can tell that the OPS metric weights home runs most heavily.
You have used 1 of 2 attempts Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Answers are displayed within the problem
1/1 point (graded) What statistical concept properly explains the “sophomore slump”? Regression to the mean Law of averages Normal distribution correct
Explanation
Regression to the mean is what explains the sophomore slump. The correlation for performance in two separate years is high but not perfect, so high performers will tend to perform slightly worse in the following year (and low performers will tend to perform slightly better in the following year). You have used 1 of 1 attempt Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Answers are displayed within the problem
1/1 point (graded) In our model of time vs. observed_distance in the video “Measurement Error Models”, the randomness of our data was due to: sampling natural variability measurement error correct
Explanation
Measurement error models look at applications where randomness is introduced from measurement error instead of sampling or natural variability. You have used 1 of 1 attempt Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Answers are displayed within the problem
1/1 point (graded) Which of the following are important assumptions about the measurement errors in the experiment presented in the video “Measurement Error Models”?
Select ALL that apply. The measurement error is random The measurement error is independent The measurement error has the same distribution for each time correct
Explanation
In this model, we asumed that the measurement errors were random, independent from each other, and had the same distribution for each time i.
We also assumed that there was no bias, which means that
. You have used 1 of 2 attempts Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Answers are displayed within the problem
0/1 point (graded) Which of the following scenarios would violate an assumption of our measurement error model? The experiment was conducted on the moon. There was one position where it was particularly difficult to see the dropped ball. correct The experiment was only repeated 10 times, not 100 times. incorrect
Explanation
If there were one position where it was particularly difficult to see the dropped ball, that would violate the assumption of randomness. If the experiment were conducted on the moon, that would simply predict a different gravitational constant. Repeating the experiment 10 instead of 100 times would not matter because we do not need a large sample for our assumptions to be valid in this model. You have used 1 of 1 attempt Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Answers are displayed within the problem
library(tidyverse)
library(Lahman)
library(broom)
library(ggplot2)
fit <- Teams %>%
filter(yearID %in% 1961:2001) %>%
mutate(BB = BB/G,
singles = (H-X2B-X3B-HR)/G,
doubles = X2B/G,
triples = X3B/G,
HR = HR/G,
R = R/G) %>%
lm(R ~ BB + singles + doubles + triples + HR, data = .)
tidy(fit, conf.int = T)
## # A tibble: 6 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.77 0.0862 -32.1 4.76e-157 -2.94 -2.60
## 2 BB 0.371 0.0117 31.6 1.87e-153 0.348 0.394
## 3 singles 0.519 0.0127 40.8 8.67e-217 0.494 0.544
## 4 doubles 0.771 0.0226 34.1 8.44e-171 0.727 0.816
## 5 triples 1.24 0.0768 16.1 2.12e- 52 1.09 1.39
## 6 HR 1.44 0.0243 59.3 0 1.40 1.49
team_a <- data.frame(BB=2, singles=4, doubles=1, triples=0, HR=1)
team_b <- data.frame(BB=1, singles=6, doubles=2, triples=1, HR=0)
# https://www.guru99.com/r-data-frames.html
predict(fit, newdata = team_a)
## 1
## 2.27
predict(fit, newdata = team_b)
## 1
## 3.5
Comprehension Check due Jun 12, 2022 00:29 AWST Completed
Question 9 has two parts. Use the information below to answer both parts.
Use the Teams data frame from the Lahman package. Fit a multivariate linear regression model to obtain the effects of BB and HR on Runs (R) in 1971. Use the tidy() function in the broom package to obtain the results in a data frame.
2.0/2.0 points (graded) What is the estimate for the effect of BB on runs? correct
0.414 Loading
Explanation
The estimate can be calculated using the following code:
library(Lahman)
library(broom) Teams %>% filter(yearID == 1971) %>% lm(R ~ BB + HR, data = .) %>% tidy() %>% filter(term == “BB”) %>% pull(estimate)
What is the estimate for the effect of HR on runs? correct
1.30 Loading
Explanation
The estimate can be calculated using the following code:
Teams %>%
filter(yearID == 1971) %>%
lm(R ~ BB + HR, data = .) %>%
tidy() %>%
filter(term == "HR") %>%
pull(estimate)
You have used 1 of 10 attempts Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Answers are displayed within the problem
1.0/1.0 point (graded)
Interpret the p-values for the estimates using a cutoff of 0.05 and considering the year 1971 as a sample to make inference on the population of all baseball games across years. Which of the following is the correct interpretation? Both BB and HR have a nonzero effect on runs. HR has a significant effect on runs, but the evidence is not strong enough to suggest BB also does. BB has a significant effect on runs, but the evidence is not strong enough to suggest HR also does. Neither BB nor HR have a statistically significant effect on runs. correct
Explanation
The p-value for HR is less than 0.05, but the p-value of BB is greater than 0.05 (0.06), so the evidence is not strong enough to suggest that BB has a significant effect on runs at a p-value cutoff of 0.05. You have used 1 of 2 attempts Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Answers are displayed within the problem
1.0/1.0 point (graded)
Repeat the above exercise to find the effects of BB and HR on runs (R) for every year from 1961 to 2018 using do() and the broom package.
Make a scatterplot of the estimate for the effect of BB on runs over time and add a trend line with confidence intervals.
Fill in the blank to complete the statement:
The effect of BB on runs has correct
increased
over time.
Explanation
The scatterplot with trendline can be made using the following code:
res <- Teams %>%
filter(yearID %in% 1961:2018) %>%
group_by(yearID) %>%
do(tidy(lm(R ~ BB + HR, data = .))) %>%
ungroup()
res %>% filter(term == “BB”) %>% ggplot(aes(yearID, estimate)) + geom_point() + geom_smooth(method = “lm”)
You have used 1 of 1 attempt Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Answers are displayed within the problem
2.0/2.0 points (graded)
Fit a linear model on the results from Question 10 to determine the effect of year on the impact of BB. For each additional year, by what value does the impact of BB on runs change? correct
0.00355 Loading
Explanation
The value can be calculated using the following code:
res %>%
filter(term == "BB") %>%
lm(estimate ~ yearID, data = .) %>%
tidy() %>%
filter(term == "yearID") %>%
pull(estimate)
What is the p-value for this effect? correct
0.00807 Loading
Explanation
The p-value can be calculated using the following code:
res %>%
filter(term == "BB") %>%
lm(estimate ~ yearID, data = .) %>%
tidy() %>%
filter(term == "yearID") %>%
pull(p.value)
You have used 2 of 10 attempts Some
library(Lahman)
library(broom)
n_teams1971 <- Teams %>%
filter(yearID == 1971) %>%
mutate(R = R/G, BB = BB/G, HR = HR/G) %>%
lm(R ~ BB + HR, data = .)
tidy(n_teams1971, conf.int = T)
## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.60 0.665 2.40 0.0256 0.215 2.98
## 2 BB 0.412 0.203 2.03 0.0552 -0.0101 0.835
## 3 HR 1.29 0.427 3.03 0.00639 0.406 2.18
library(Lahman)
library(broom)
n_teams1971 <- Teams %>%
filter(yearID %in% 1961:2018) %>%
mutate(R_per_game = R/G, BB_per_game = BB/G, HR_per_game = HR/G) %>%
group_by(yearID) %>%
do(data.frame(term = names(lm(R_per_game ~ BB_per_game + HR_per_game, data = .)$coefficients),
estimate = lm(R_per_game ~ BB_per_game + HR_per_game, data = .)$coefficients)) %>%
filter(term == "BB_per_game")
n_teams1971
## # A tibble: 58 × 3
## # Groups: yearID [58]
## yearID term estimate
## <int> <chr> <dbl>
## 1 1961 BB_per_game 0.0845
## 2 1962 BB_per_game 0.142
## 3 1963 BB_per_game 0.339
## 4 1964 BB_per_game -0.105
## 5 1965 BB_per_game 0.235
## 6 1966 BB_per_game 0.104
## 7 1967 BB_per_game 0.0660
## 8 1968 BB_per_game -0.199
## 9 1969 BB_per_game 0.153
## 10 1970 BB_per_game 0.239
## # … with 48 more rows
m_teams1971 <- Teams %>%
filter(yearID %in% 1961:2018) %>%
group_by(yearID) %>%
#mutate(R = R/G, BB = BB/G, HR = HR/G) %>%
do(tidy(lm(R ~ BB + HR, data = .))) %>%
filter(term == "BB")
m_teams1971
## # A tibble: 58 × 6
## # Groups: yearID [58]
## yearID term estimate std.error statistic p.value
## <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1961 BB 0.205 0.156 1.32 0.208
## 2 1962 BB 0.179 0.283 0.632 0.536
## 3 1963 BB 0.346 0.242 1.43 0.171
## 4 1964 BB -0.102 0.302 -0.339 0.739
## 5 1965 BB 0.221 0.255 0.869 0.397
## 6 1966 BB 0.125 0.214 0.585 0.567
## 7 1967 BB 0.0730 0.223 0.327 0.747
## 8 1968 BB -0.198 0.205 -0.966 0.348
## 9 1969 BB 0.153 0.163 0.938 0.359
## 10 1970 BB 0.238 0.156 1.52 0.142
## # … with 48 more rows
m_teams1971 %>%
ggplot(aes(yearID, estimate)) +
geom_point() +
geom_smooth(method = "lm") # remember to use this one, not hard reading scatter plot dots
## `geom_smooth()` using formula 'y ~ x'
m_teams1971 %>%
lm(estimate ~ yearID, data = .) %>%
tidy
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -6.75 2.57 -2.62 0.0112
## 2 yearID 0.00355 0.00129 2.75 0.00807
res <- Teams %>%
filter(yearID %in% 1961:2018) %>%
group_by(yearID) %>%
do(tidy(lm(R ~ BB + HR, data = .))) %>%
ungroup()
res
## # A tibble: 174 × 6
## yearID term estimate std.error statistic p.value
## <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1961 (Intercept) 455. 89.8 5.07 0.000139
## 2 1961 BB 0.205 0.156 1.32 0.208
## 3 1961 HR 0.999 0.300 3.33 0.00455
## 4 1962 (Intercept) 448. 149. 3.01 0.00789
## 5 1962 BB 0.179 0.283 0.632 0.536
## 6 1962 HR 1.18 0.504 2.34 0.0316
## 7 1963 (Intercept) 281. 118. 2.38 0.0293
## 8 1963 BB 0.346 0.242 1.43 0.171
## 9 1963 HR 1.42 0.299 4.75 0.000186
## 10 1964 (Intercept) 512. 113. 4.54 0.000293
## # … with 164 more rows
res %>%
filter(term == "BB") %>%
ggplot(aes(yearID, estimate)) +
geom_point() +
geom_smooth(method = "lm") # remember to use this one, not hard reading scatter plot dots
## `geom_smooth()` using formula 'y ~ x'
Question 11 explanation,
discussion posted 5 days ago by sonicksuri
Hi, I am not able to understand use of estimate ~ BB in question 11, incase we want to see effect of year on BB, why are we getting different answer with lm(BB ~ year) ? This post is visible to everyone. 1 response
nickbirk (Staff)
a day ago
Hello, I think the interpretation of the question can be a bit tricky, so I will expand on its meaning. The question reads:
“Fit a linear model on the results from Question 10 to determine the effect of year on the impact of BB.”
Breaking this into separate parts, “Fit a linear model on the results from Question 10” means we are re-using our previous model, which we can notate as E(Runs|HR and BB) = (Intercept) + B1(HR) + B2(BB), where B1 and B2 represent the estimated coefficients for HR and BB, respectively. Recall that for question 10, we also fit a separate model for each year from 1961 to 2018. That is, we estimate 58 different models for each year, and therefore 58 different intercepts, 58 different estimated value of B1, and 58 estimated values of B2. These 58 models are the “Results from question 10”.
Next, we have “determine the effect of year on the impact of BB”. That is, we want to see how the year variable impacts the estimated impact of BB, AKA our estimated coefficient B2. That is, we want to create a linear model where our 58 values of B2 are predicted by the variable yearID. When we filter our results from question 10 to just include the BB term, this is where the formula estimate ~ yearID comes into play.
I have taken exactly your mentioned approach, using estimated BB of yearID %in% 1961:2018, and fitted another LM model. But the outcome is 0.0x3 and p-value is 4.xxe-03, and this outcomes seems failed. I don't know
posted about 10 hours ago by john_hhu2020
I see, just knows whats going wrong after reading previous Explanation. So we are concerning about BB + HR impact on R. Not BB_per_game and HR_per_game impact on R_per_game.
Its funny my wrong approach got me success in previous questions. Just the estimated values are way smaller than the right one. Thank you
posted less than a minute ago by john_hhu2020 # ==================================================================================================================================
Level 2 headings may be created by course providers in the future. Graded assignments are locked Upgrade to gain access to locked features like this one and [][get the most out of your course.] Sorry I didn’t take this far Example Certificate When you upgrade, you:
Earn a verified certificate of completion to showcase on your resumé Unlock your access to all course activities, including graded assignments Full access to course content and materials, even after the course ends Support our mission at edX
Assessment due Jun 12, 2022 00:29 AWST
This assessment has 6 multi-part questions that will all use the setup below.
Game attendance in baseball varies partly as a function of how well a team is playing.
Load the Lahman library. The Teams data frame contains an attendance column. This is the total attendance for the season. To calculate average attendance, divide by the number of games played, as follows:
library(tidyverse) library(broom) library(Lahman) Teams_small <- Teams %>% filter(yearID %in% 1961:2001) %>% mutate(avg_attendance = attendance/G)
Use linear models to answer the following 3-part question about Teams_small.
Stratify Teams_small by wins: divide number of wins by 10 and then round to the nearest integer. Keep only strata 5 through 10, which have 20 or more data points.
Use the stratified dataset to answer this three-part question.
This meant to be empty
[][In the Confounding section, you will learn what is perhaps the most important lesson of statistics: that correlation is not causation.]
After completing this section, you will be able to:
[][* Identify examples of spurious correlation and explain how data dredging can lead to spurious correlation.*] [][* Explain how outliers can drive correlation and learn to adjust for outliers using Spearman correlation.*] [][* Explain how reversing cause and effect can lead to associations being confused with causation.*] Understand how confounders can lead to the misinterpretation of associations. Explain and give examples of Simpson’s Paradox.
This section has one part: Correlation is Not Causation. There is a comprehension checks at the end of this part, along with an assessment at the end of the section for Verified learners only.
We encourage you to use R to interactively test out your answers and further your own learning. If you get stuck, we encourage you to search the discussion boards for the answer to your issue or ask us for help!
( What is spurious correlation? https://www.google.com/search?client=firefox-b-e&q=spurious+correlation Key Takeaways. Spurious correlation, or spuriousness, occurs when two factors appear casually related to one another but are not. The appearance of a causal relationship is often due to similar movement on a chart that turns out to be coincidental or caused by a third “confounding” factor. )
( What is data dredging in research? What is data dredging (data fishing)? Data dredging – sometimes referred to as data fishing – is a data mining practice in which large data volumes are analyzed to find any possible relationships between the data.)
( Data dredging https://en.wikipedia.org/wiki/Data_dredging Data dredging (also known as data snooping or p-hacking)[1][a] is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. This is done by performing many statistical tests on the data and only reporting those that come back with significant results.[2] )
( What is the difference between Pearson and Spearman correlation? https://www.google.com/search?client=firefox-b-e&q=Spearman+correlation Pearson correlation: Pearson correlation evaluates the linear relationship between two continuous variables. Spearman correlation: Spearman correlation evaluates the monotonic relationship. The Spearman correlation coefficient is based on the ranked values for each variable rather than the raw data. )
( Reverse causation occurs when you believe that X causes Y, but in reality Y actually causes X. This is a common error that many people make when they look at two phenomenon and wrongly assume that one is the cause while the other is the effect. )
[][*Correlation is not causation is perhaps the most important lesson one learns in a statistics class] (How we come up into this? Spurious correlation?). ******In this course, we have described tools useful for quantifying associations between variables, but we must be careful not to over interpret these associations******. There are many reasons that a variable x can correlate with a variable y, without either being a cause for the other. Here we examine common ways that can lead to misinterpreting associations.
The first example of how we can misinterpret associations are spurious correlations. The following comical example underscores that correlation is not causation. The example shows a very strong correlation between divorce rates and margarine consumption. The correlation is 0.93. Does this mean that margarine causes divorces, or do divorces cause people to eat more margarine? Of course, the answer to both these questions is no. This is just an example of what we call spurious correlations. You can see many, many more observed examples in this website completely dedicated to spurious correlations. In fact, that’s the title of the website. The cases presented in the spurious correlation site are all examples of what is generally called [][data dredging], or data phishing, or data snooping. It’s basically a form of what in the United States, they call [][cherry picking].
[][An example of data dredging would be if you look through many results produced by a random process, and pick the one that shows a relationship that supports the theory you want to defend]. A Monte Carlo simulation can be used to show how data dredging can result in finding high correlations among variables that are theoretically uncorrelated. We’ll save the results of a simulation into a table like this. The first column denotes group and we simulated one million groups, each with 25 observations. For each group, we generate 25 observations which are stored in the second and third column. These are just random, independent normally distributed data. So we know, because we constructed the simulation, that x and y are not correlated. Next, we compute the correlation between x and y for each group, and look for the maximum. Here are the top correlations. If we just plot the data from this particular group, it shows a convincing plot that x and y are, in fact, correlated. But remember that the correlations number is a random variable. Here’s the distribution we just generated with our Monte Carlo simulation.
[][It is just a mathematical fact that if we observe random correlations that are expected to be 0, but have a standard error of about 0.2, the largest one will be close to 1 if we pick from among one million]. Note that if we performed regression on this group and interpreted the p-value, we would incorrectly claim this was a statistically significant relation. Here’s the code. Look how small the p-value is. This particular form of data dredging is referred to as p-hacking. P-hacking is a topic of much discussion because it is a problem in scientific publications. Because publishers tend to reward statistically significant results over negative results, there’s an incentive to report significant results.
In epidemiology in the social sciences for example, researchers may look for associations between an average outcome and several exposures, and report only the one exposure that resulted in a small p-value. Furthermore, they might try fitting several different models to adjust for confounding and pick the one model that yields the smallest p-value. In experimental disciplines, an experiment might be repeated more than once, and only the one that results in a small p-value are reported. This does not necessarily happen due to unethical behavior, but rather to statistical ignorance or wishful thinking. In advanced statistics courses, you’ll learn methods to adjust for what is called the [][multiple comparison problem].
[][Textbook link]
This video corresponds to the textbook section on spurious correlation External link. https://rafalab.github.io/dsbook/association-is-not-causation.html#spurious-correlation
[][Key points]
Association/correlation is not causation.
[][* p-hacking is a topic of much discussion because it is a problem in scientific publications. Because publishers tend to reward statistically significant results over negative results, there is an incentive to report significant results.*]
Code
N <- 25 g <- 1000000 sim_data <- tibble(group = rep(1:g, each = N), x = rnorm(N * g), y = rnorm(N * g))
res <- sim_data %>% group_by(group) %>% summarize(r = cor(x, y)) %>% arrange(desc(r)) res
sim_data %>% filter(group == res\(group[which.max(res\)r)]) %>% ggplot(aes(x, y)) + geom_point() + geom_smooth(method = “lm”)
res %>% ggplot(aes(x=r)) + geom_histogram(binwidth = 0.1, color = “black”)
library(broom) sim_data %>% filter(group == res\(group[which.max(res\)r)]) %>% do(tidy(lm(y ~ x, data = .)))
N <- 25
G <- 1000000
sim_data <- tibble(group = rep(1:G, each = N), X = rnorm(N*G), Y = rnorm(N*G))
sum(sim_data$Y[sim_data$Y>=0.05])/sum(sim_data$Y)
## [1] 4807
#N <- 25
#N <- 50
N <- 100
G <- 1000000
sim_data <- tibble(group = rep(1:G, each = N), X = rnorm(N*G), Y = rnorm(N*G))
sim_data
## # A tibble: 100,000,000 × 3
## group X Y
## <int> <dbl> <dbl>
## 1 1 0.788 1.54
## 2 1 -1.51 -0.00109
## 3 1 -0.999 0.382
## 4 1 -0.390 0.392
## 5 1 1.07 1.01
## 6 1 -2.65 0.284
## 7 1 1.84 0.331
## 8 1 -0.887 0.519
## 9 1 0.264 0.261
## 10 1 0.0309 -1.38
## # … with 99,999,990 more rows
res <- sim_data %>%
group_by(group) %>%
summarise(r = cor(X, Y)) %>%
arrange(desc(r))
res
## # A tibble: 1,000,000 × 2
## group r
## <int> <dbl>
## 1 253402 0.468
## 2 643179 0.453
## 3 618168 0.452
## 4 758422 0.448
## 5 17643 0.447
## 6 858354 0.436
## 7 561473 0.435
## 8 193050 0.423
## 9 817115 0.422
## 10 577035 0.422
## # … with 999,990 more rows
sim_data %>%
filter(group == res$group[which.max(res$r)]) %>%
ggplot(aes(X, Y)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x)
res %>%
ggplot(aes(x = r)) +
geom_histogram(bins = 20, color = "black")
sim_data %>%
filter(group == res$group[which.max(res$r)]) %>%
do(tidy(lm(Y ~ X, data = .)))
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.0419 0.0976 -0.429 0.669
## 2 X 0.455 0.0868 5.24 0.000000938
now do you remember how we use “lm” to specify this line in ggplot ???
these are just random, independent normally distributed data
look how small the p value is here, though only one is small
[][Another way that we can see high correlations when there’s no causation is when we have outliers]. Suppose we take measurements from two independent outcomes, x and y, and we standardize the measurements. However, imagine we made a mistake and forgot to standardize entry 23. We can simulate such data using the following code. The data looks like this. Not surprisingly, the correlation is very high. That one point, that one outlier, is making the correlation be as high as 0.99. But again, this is driven by that one outlier.
If we remove this outlier, the correlation is greatly reduced to almost 0, which is what it should be. Here’s what we get if we remove entry 23. So one way to deal with outliers is to try to detect them and remove them. But there is an alternative way to the sample correlation for estimating the population correlation that is robust to outliers. It is called [][Spearman correlation]. The idea is simple. Compute the correlation on the ranks of the values, rather than the values themselves. Here’s a plot of the ranks plotted against each other for that data set that includes the outlier. Note that the one point that’s very large is just at the 100-100 location. It is no longer really out there and pulling the correlation towards 1. So if we compute the correlation of the ranks, we get something much closer to 0, as we see here. Spearman correlation can also be calculated with the correlation function, but using the method argument to tell cor which correlation to compute. There are also methods for robust fitting of linear models, which you can learn about in, for example, this book.
[][Textbook link]
This video corresponds to the textbook section on outliers. https://rafalab.github.io/dsbook/association-is-not-causation.html#outliers-1
[][Key points]
Correlations can be caused by outliers.
The Spearman correlation is calculated based on the ranks of data.
Code
set.seed(1985) x <- rnorm(100,100,1) y <- rnorm(100,84,1) x[-23] <- scale(x[-23]) y[-23] <- scale(y[-23])
qplot(x, y, alpha = 0.5)
cor(x,y) cor(x[-23], y[-23])
qplot(rank(x), rank(y)) cor(rank(x), rank(y))
cor(x, y, method = “spearman”)
Another way associations are confounded with causation is when the cause and effect are reversed. An example of this is claiming that tutoring makes students perform worse because they test lower than peers that are not tutored. Here, the tutoring is not causing the low test, but the other way around. A form of this claim was actually made in an op ed in the New York Times, titled “Parental Involvement is Overrated”. Consider this quote from the article. “When we examine whether regular help with homework had a positive impact on children’s academic performance, we were quite startled by what we found. Regardless of family social class, racial, or ethnic background, or child’s grade level, consistent homework help almost never improved test scores or grades. Even more surprising to us was that when parents regularly helped with homework, kids usually performed worse.” A very likely possibility is that children needing regular parental help get this help because they don’t perform well in school.
To see another example, we’re going to use one of the data sets that we’ve seen in this course. Specifically, we can easily construct an example of cause and effect reversal using the father and son height data. Note that if we fit the following model to the father and son height data, with x representing the father height, and y representing the son height, we do get a statistically significant result. You can see that with this simple code. This model fits the data very well. However, if we look at the mathematical formulation of the model, it could easily be incorrectly interpreted as to suggest that the son being tall caused the father to be tall. But given what we know about genetics and biology, we know it’s the other way around. The model is technically correct. The estimates and p-value were obtained correctly as well. What is wrong here is simply the interpretation.
[][Textbook link]
This video corresponds to the textbook section on reversing cause and effect. https://rafalab.github.io/dsbook/association-is-not-causation.html#reversing-cause-and-effect
[][Key points]
Another way association can be confused with causation is when the cause and effect are reversed.
As discussed in the video, in the Galton data, when father and son were reversed in the regression, the model was technically correct. The estimates and p-values were obtained correctly as well. What was incorrect was the interpretation of the model.
Code
library(HistData) data(“GaltonFamilies”) GaltonFamilies %>% filter(childNum == 1 & gender == “male”) %>% select(father, childHeight) %>% rename(son = childHeight) %>% do(tidy(lm(father ~ son, data = .)))
library(tidyverse)
library(broom)
library(HistData)
data("GaltonFamilies")
GaltonFamilies
## family father mother midparentHeight children childNum gender childHeight
## 1 001 78.5 67.0 75.4 4 1 male 73.2
## 2 001 78.5 67.0 75.4 4 2 female 69.2
## 3 001 78.5 67.0 75.4 4 3 female 69.0
## 4 001 78.5 67.0 75.4 4 4 female 69.0
## 5 002 75.5 66.5 73.7 4 1 male 73.5
## 6 002 75.5 66.5 73.7 4 2 male 72.5
## 7 002 75.5 66.5 73.7 4 3 female 65.5
## 8 002 75.5 66.5 73.7 4 4 female 65.5
## 9 003 75.0 64.0 72.1 2 1 male 71.0
## 10 003 75.0 64.0 72.1 2 2 female 68.0
## 11 004 75.0 64.0 72.1 5 1 male 70.5
## 12 004 75.0 64.0 72.1 5 2 male 68.5
## 13 004 75.0 64.0 72.1 5 3 female 67.0
## 14 004 75.0 64.0 72.1 5 4 female 64.5
## 15 004 75.0 64.0 72.1 5 5 female 63.0
## 16 005 75.0 58.5 69.1 6 1 male 72.0
## 17 005 75.0 58.5 69.1 6 2 male 69.0
## 18 005 75.0 58.5 69.1 6 3 male 68.0
## 19 005 75.0 58.5 69.1 6 4 female 66.5
## 20 005 75.0 58.5 69.1 6 5 female 62.5
## 21 005 75.0 58.5 69.1 6 6 female 62.5
## 22 006 74.0 68.0 73.7 1 1 female 69.5
## 23 007 74.0 68.0 73.7 6 1 male 76.5
## 24 007 74.0 68.0 73.7 6 2 male 74.0
## 25 007 74.0 68.0 73.7 6 3 male 73.0
## 26 007 74.0 68.0 73.7 6 4 male 73.0
## 27 007 74.0 68.0 73.7 6 5 female 70.5
## 28 007 74.0 68.0 73.7 6 6 female 64.0
## 29 008 74.0 66.5 72.9 3 1 female 70.5
## 30 008 74.0 66.5 72.9 3 2 female 68.0
## 31 008 74.0 66.5 72.9 3 3 female 66.0
## 32 009 74.5 66.0 72.9 1 1 female 66.0
## 33 010 74.0 65.5 72.4 1 1 female 65.5
## 34 011 74.0 62.0 70.5 8 1 male 74.0
## 35 011 74.0 62.0 70.5 8 2 male 70.0
## 36 011 74.0 62.0 70.5 8 3 female 68.0
## 37 011 74.0 62.0 70.5 8 4 female 67.0
## 38 011 74.0 62.0 70.5 8 5 female 67.0
## 39 011 74.0 62.0 70.5 8 6 female 66.0
## 40 011 74.0 62.0 70.5 8 7 female 63.5
## 41 011 74.0 62.0 70.5 8 8 female 63.0
## 42 012 74.0 61.0 69.9 1 1 female 65.0
## 43 013 73.0 67.0 72.7 2 1 male 71.0
## 44 013 73.0 67.0 72.7 2 2 female 62.0
## 45 014 73.0 67.0 72.7 2 1 male 68.0
## 46 014 73.0 67.0 72.7 2 2 male 67.0
## 47 015 73.0 66.5 72.4 3 1 male 71.0
## 48 015 73.0 66.5 72.4 3 2 male 70.5
## 49 015 73.0 66.5 72.4 3 3 female 66.7
## 50 016 73.0 65.0 71.6 9 1 male 72.0
## 51 016 73.0 65.0 71.6 9 2 male 70.5
## 52 016 73.0 65.0 71.6 9 3 male 70.2
## 53 016 73.0 65.0 71.6 9 4 male 70.2
## 54 016 73.0 65.0 71.6 9 5 male 69.2
## 55 016 73.0 65.0 71.6 9 6 female 68.7
## 56 016 73.0 65.0 71.6 9 7 female 66.5
## 57 016 73.0 65.0 71.6 9 8 female 64.5
## 58 016 73.0 65.0 71.6 9 9 female 63.5
## 59 017 73.0 64.5 71.3 6 1 male 74.0
## 60 017 73.0 64.5 71.3 6 2 male 73.0
## 61 017 73.0 64.5 71.3 6 3 male 71.5
## 62 017 73.0 64.5 71.3 6 4 male 62.5
## 63 017 73.0 64.5 71.3 6 5 female 66.5
## 64 017 73.0 64.5 71.3 6 6 female 62.3
## 65 018 73.0 64.0 71.1 3 1 female 66.0
## 66 018 73.0 64.0 71.1 3 2 female 64.5
## 67 018 73.0 64.0 71.1 3 3 female 64.0
## 68 019 73.2 63.0 70.6 1 1 female 62.7
## 69 020 72.7 69.0 73.6 8 1 male 73.2
## 70 020 72.7 69.0 73.6 8 2 male 73.0
## 71 020 72.7 69.0 73.6 8 3 male 72.7
## 72 020 72.7 69.0 73.6 8 4 female 70.0
## 73 020 72.7 69.0 73.6 8 5 female 69.0
## 74 020 72.7 69.0 73.6 8 6 female 68.5
## 75 020 72.7 69.0 73.6 8 7 female 68.0
## 76 020 72.7 69.0 73.6 8 8 female 66.0
## 77 021 72.0 68.0 72.7 3 1 male 73.0
## 78 021 72.0 68.0 72.7 3 2 female 68.5
## 79 021 72.0 68.0 72.7 3 3 female 68.0
## 80 022 72.0 67.0 72.2 3 1 male 73.0
## 81 022 72.0 67.0 72.2 3 2 male 71.0
## 82 022 72.0 67.0 72.2 3 3 female 67.0
## 83 023 72.0 65.0 71.1 7 1 male 74.2
## 84 023 72.0 65.0 71.1 7 2 male 70.5
## 85 023 72.0 65.0 71.1 7 3 male 69.5
## 86 023 72.0 65.0 71.1 7 4 female 66.0
## 87 023 72.0 65.0 71.1 7 5 female 65.5
## 88 023 72.0 65.0 71.1 7 6 female 65.0
## 89 023 72.0 65.0 71.1 7 7 female 65.0
## 90 024 72.0 65.5 71.4 1 1 female 65.5
## 91 025 72.0 64.0 70.6 2 1 female 66.0
## 92 025 72.0 64.0 70.6 2 2 female 63.0
## 93 026 72.0 63.0 70.0 5 1 male 70.5
## 94 026 72.0 63.0 70.0 5 2 male 70.5
## 95 026 72.0 63.0 70.0 5 3 male 69.0
## 96 026 72.0 63.0 70.0 5 4 female 65.0
## 97 026 72.0 63.0 70.0 5 5 female 63.0
## 98 027 72.0 63.0 70.0 3 1 male 69.0
## 99 027 72.0 63.0 70.0 3 2 male 67.0
## 100 027 72.0 63.0 70.0 3 3 female 63.0
## 101 028 72.0 63.0 70.0 6 1 male 73.0
## 102 028 72.0 63.0 70.0 6 2 male 67.0
## 103 028 72.0 63.0 70.0 6 3 female 70.5
## 104 028 72.0 63.0 70.0 6 4 female 70.0
## 105 028 72.0 63.0 70.0 6 5 female 66.5
## 106 028 72.0 63.0 70.0 6 6 female 63.0
## 107 029 72.5 63.5 70.5 3 1 female 67.5
## 108 029 72.5 63.5 70.5 3 2 female 67.2
## 109 029 72.5 63.5 70.5 3 3 female 66.7
## 110 030 72.0 62.0 69.5 1 1 female 64.0
## 111 031 72.5 62.0 69.7 6 1 male 71.0
## 112 031 72.5 62.0 69.7 6 2 male 70.0
## 113 031 72.5 62.0 69.7 6 3 male 70.0
## 114 031 72.5 62.0 69.7 6 4 female 66.0
## 115 031 72.5 62.0 69.7 6 5 female 65.0
## 116 031 72.5 62.0 69.7 6 6 female 65.0
## 117 032 72.0 62.0 69.5 5 1 male 74.0
## 118 032 72.0 62.0 69.5 5 2 male 72.0
## 119 032 72.0 62.0 69.5 5 3 male 69.0
## 120 032 72.0 62.0 69.5 5 4 female 67.5
## 121 032 72.0 62.0 69.5 5 5 female 63.5
## 122 033 72.0 62.0 69.5 5 1 male 72.0
## 123 033 72.0 62.0 69.5 5 2 male 71.5
## 124 033 72.0 62.0 69.5 5 3 male 71.5
## 125 033 72.0 62.0 69.5 5 4 male 70.0
## 126 033 72.0 62.0 69.5 5 5 female 68.0
## 127 034 72.0 61.0 68.9 1 1 female 65.7
## 128 035 71.0 69.0 72.8 5 1 male 78.0
## 129 035 71.0 69.0 72.8 5 2 male 74.0
## 130 035 71.0 69.0 72.8 5 3 male 73.0
## 131 035 71.0 69.0 72.8 5 4 male 72.0
## 132 035 71.0 69.0 72.8 5 5 female 67.0
## 133 036 71.0 67.0 71.7 4 1 male 73.2
## 134 036 71.0 67.0 71.7 4 2 male 73.0
## 135 036 71.0 67.0 71.7 4 3 male 69.0
## 136 036 71.0 67.0 71.7 4 4 female 67.0
## 137 037 71.0 66.0 71.1 4 1 male 70.0
## 138 037 71.0 66.0 71.1 4 2 female 67.0
## 139 037 71.0 66.0 71.1 4 3 female 67.0
## 140 037 71.0 66.0 71.1 4 4 female 66.5
## 141 038 71.0 66.0 71.1 6 1 male 70.0
## 142 038 71.0 66.0 71.1 6 2 male 69.0
## 143 038 71.0 66.0 71.1 6 3 male 68.5
## 144 038 71.0 66.0 71.1 6 4 female 66.0
## 145 038 71.0 66.0 71.1 6 5 female 64.5
## 146 038 71.0 66.0 71.1 6 6 female 63.0
## 147 039 71.0 66.0 71.1 2 1 male 71.0
## 148 039 71.0 66.0 71.1 2 2 female 67.0
## 149 040 71.0 66.0 71.1 5 1 male 76.0
## 150 040 71.0 66.0 71.1 5 2 male 72.0
## 151 040 71.0 66.0 71.1 5 3 male 71.0
## 152 040 71.0 66.0 71.1 5 4 male 66.0
## 153 040 71.0 66.0 71.1 5 5 female 66.0
## 154 041 71.7 65.5 71.2 1 1 male 70.5
## 155 042 71.0 65.5 70.9 6 1 male 72.0
## 156 042 71.0 65.5 70.9 6 2 male 72.0
## 157 042 71.0 65.5 70.9 6 3 male 71.0
## 158 042 71.0 65.5 70.9 6 4 male 69.0
## 159 042 71.0 65.5 70.9 6 5 female 66.0
## 160 042 71.0 65.5 70.9 6 6 female 65.0
## 161 043 71.5 65.5 71.1 2 1 male 73.0
## 162 043 71.5 65.5 71.1 2 2 female 65.2
## 163 044 71.5 65.0 70.8 2 1 male 68.5
## 164 044 71.5 65.0 70.8 2 2 male 67.7
## 165 045 71.0 65.0 70.6 3 1 male 68.0
## 166 045 71.0 65.0 70.6 3 2 male 68.0
## 167 045 71.0 65.0 70.6 3 3 female 62.0
## 168 046 71.0 64.0 70.1 8 1 female 68.0
## 169 046 71.0 64.0 70.1 8 2 female 68.0
## 170 046 71.0 64.0 70.1 8 3 female 67.5
## 171 046 71.0 64.0 70.1 8 4 female 66.5
## 172 046 71.0 64.0 70.1 8 5 female 66.5
## 173 046 71.0 64.0 70.1 8 6 female 66.0
## 174 046 71.0 64.0 70.1 8 7 female 65.5
## 175 046 71.0 64.0 70.1 8 8 female 65.0
## 176 047 71.7 64.5 70.7 4 1 male 72.0
## 177 047 71.7 64.5 70.7 4 2 male 71.0
## 178 047 71.7 64.5 70.7 4 3 male 70.5
## 179 047 71.7 64.5 70.7 4 4 female 67.0
## 180 048 71.0 64.0 70.1 3 1 male 68.0
## 181 048 71.0 64.0 70.1 3 2 male 68.0
## 182 048 71.0 64.0 70.1 3 3 male 68.0
## 183 049 71.5 64.5 70.6 7 1 male 72.0
## 184 049 71.5 64.5 70.6 7 2 male 71.0
## 185 049 71.5 64.5 70.6 7 3 male 70.0
## 186 049 71.5 64.5 70.6 7 4 female 66.0
## 187 049 71.5 64.5 70.6 7 5 female 64.5
## 188 049 71.5 64.5 70.6 7 6 female 64.5
## 189 049 71.5 64.5 70.6 7 7 female 62.0
## 190 050 71.0 64.5 70.3 2 1 male 73.0
## 191 050 71.0 64.5 70.3 2 2 female 62.0
## 192 051 71.2 63.0 69.6 2 1 female 67.5
## 193 051 71.2 63.0 69.6 2 2 female 64.5
## 194 052 71.0 63.5 69.8 5 1 male 71.0
## 195 052 71.0 63.5 69.8 5 2 male 67.0
## 196 052 71.0 63.5 69.8 5 3 female 66.0
## 197 052 71.0 63.5 69.8 5 4 female 65.0
## 198 052 71.0 63.5 69.8 5 5 female 63.5
## 199 053 71.0 63.0 69.5 9 1 male 71.0
## 200 053 71.0 63.0 69.5 9 2 male 70.0
## 201 053 71.0 63.0 69.5 9 3 male 70.0
## 202 053 71.0 63.0 69.5 9 4 male 64.0
## 203 053 71.0 63.0 69.5 9 5 female 65.0
## 204 053 71.0 63.0 69.5 9 6 female 65.0
## 205 053 71.0 63.0 69.5 9 7 female 64.0
## 206 053 71.0 63.0 69.5 9 8 female 63.0
## 207 053 71.0 63.0 69.5 9 9 female 63.0
## 208 054 71.0 63.0 69.5 4 1 male 71.0
## 209 054 71.0 63.0 69.5 4 2 male 71.0
## 210 054 71.0 63.0 69.5 4 3 male 70.0
## 211 054 71.0 63.0 69.5 4 4 female 63.5
## 212 055 71.0 62.0 69.0 5 1 male 71.0
## 213 055 71.0 62.0 69.0 5 2 male 70.0
## 214 055 71.0 62.0 69.0 5 3 female 64.5
## 215 055 71.0 62.0 69.0 5 4 female 62.5
## 216 055 71.0 62.0 69.0 5 5 female 61.5
## 217 056 71.0 62.0 69.0 5 1 male 72.0
## 218 056 71.0 62.0 69.0 5 2 male 70.5
## 219 056 71.0 62.0 69.0 5 3 male 70.5
## 220 056 71.0 62.0 69.0 5 4 female 64.5
## 221 056 71.0 62.0 69.0 5 5 female 60.0
## 222 057 71.0 62.5 69.2 5 1 male 70.0
## 223 057 71.0 62.5 69.2 5 2 female 64.0
## 224 057 71.0 62.5 69.2 5 3 female 64.0
## 225 057 71.0 62.5 69.2 5 4 female 64.0
## 226 057 71.0 62.5 69.2 5 5 female 62.5
## 227 058 71.0 62.0 69.0 7 1 male 70.5
## 228 058 71.0 62.0 69.0 7 2 male 70.0
## 229 058 71.0 62.0 69.0 7 3 male 69.0
## 230 058 71.0 62.0 69.0 7 4 male 69.0
## 231 058 71.0 62.0 69.0 7 5 male 66.0
## 232 058 71.0 62.0 69.0 7 6 female 64.5
## 233 058 71.0 62.0 69.0 7 7 female 64.0
## 234 059 71.0 61.0 68.4 1 1 female 62.0
## 235 060 71.0 58.0 66.8 2 1 male 71.5
## 236 060 71.0 58.0 66.8 2 2 male 69.0
## 237 061 70.0 69.0 72.3 4 1 male 71.0
## 238 061 70.0 69.0 72.3 4 2 male 70.0
## 239 061 70.0 69.0 72.3 4 3 male 69.0
## 240 061 70.0 69.0 72.3 4 4 female 69.0
## 241 062 70.0 69.0 72.3 6 1 male 70.0
## 242 062 70.0 69.0 72.3 6 2 male 68.7
## 243 062 70.0 69.0 72.3 6 3 female 68.0
## 244 062 70.0 69.0 72.3 6 4 female 66.0
## 245 062 70.0 69.0 72.3 6 5 female 64.0
## 246 062 70.0 69.0 72.3 6 6 female 62.0
## 247 063 70.0 68.0 71.7 1 1 male 75.0
## 248 064 70.0 67.0 71.2 5 1 male 70.0
## 249 064 70.0 67.0 71.2 5 2 male 69.0
## 250 064 70.0 67.0 71.2 5 3 female 66.0
## 251 064 70.0 67.0 71.2 5 4 female 64.0
## 252 064 70.0 67.0 71.2 5 5 female 60.0
## 253 065 70.0 67.0 71.2 1 1 female 67.5
## 254 066 70.0 66.5 70.9 11 1 male 73.0
## 255 066 70.0 66.5 70.9 11 2 male 72.0
## 256 066 70.0 66.5 70.9 11 3 male 72.0
## 257 066 70.0 66.5 70.9 11 4 male 66.5
## 258 066 70.0 66.5 70.9 11 5 female 69.2
## 259 066 70.0 66.5 70.9 11 6 female 67.2
## 260 066 70.0 66.5 70.9 11 7 female 66.5
## 261 066 70.0 66.5 70.9 11 8 female 66.0
## 262 066 70.0 66.5 70.9 11 9 female 66.0
## 263 066 70.0 66.5 70.9 11 10 female 64.2
## 264 066 70.0 66.5 70.9 11 11 female 63.7
## 265 067 70.5 65.0 70.3 4 1 male 72.0
## 266 067 70.5 65.0 70.3 4 2 male 70.2
## 267 067 70.5 65.0 70.3 4 3 male 69.0
## 268 067 70.5 65.0 70.3 4 4 male 68.5
## 269 068 70.5 65.0 70.3 5 1 female 68.0
## 270 068 70.5 65.0 70.3 5 2 female 65.0
## 271 068 70.5 65.0 70.3 5 3 female 61.5
## 272 068 70.5 65.0 70.3 5 4 female 61.0
## 273 068 70.5 65.0 70.3 5 5 female 61.0
## 274 069 70.0 65.0 70.1 8 1 male 73.0
## 275 069 70.0 65.0 70.1 8 2 male 72.0
## 276 069 70.0 65.0 70.1 8 3 male 70.5
## 277 069 70.0 65.0 70.1 8 4 male 65.0
## 278 069 70.0 65.0 70.1 8 5 male 65.0
## 279 069 70.0 65.0 70.1 8 6 female 64.5
## 280 069 70.0 65.0 70.1 8 7 female 63.0
## 281 069 70.0 65.0 70.1 8 8 female 62.0
## 282 070 70.0 65.0 70.1 5 1 male 67.0
## 283 070 70.0 65.0 70.1 5 2 male 65.0
## 284 070 70.0 65.0 70.1 5 3 female 64.5
## 285 070 70.0 65.0 70.1 5 4 female 62.5
## 286 070 70.0 65.0 70.1 5 5 female 62.5
## 287 071 70.0 65.0 70.1 6 1 male 70.0
## 288 071 70.0 65.0 70.1 6 2 male 70.0
## 289 071 70.0 65.0 70.1 6 3 female 67.0
## 290 071 70.0 65.0 70.1 6 4 female 65.0
## 291 071 70.0 65.0 70.1 6 5 female 65.0
## 292 071 70.0 65.0 70.1 6 6 female 63.0
## 293 072 70.0 65.0 70.1 7 1 male 79.0
## 294 072 70.0 65.0 70.1 7 2 male 75.0
## 295 072 70.0 65.0 70.1 7 3 male 71.0
## 296 072 70.0 65.0 70.1 7 4 female 69.0
## 297 072 70.0 65.0 70.1 7 5 female 67.0
## 298 072 70.0 65.0 70.1 7 6 female 65.7
## 299 072 70.0 65.0 70.1 7 7 female 62.0
## 300 073 70.0 65.0 70.1 3 1 male 73.0
## 301 073 70.0 65.0 70.1 3 2 male 72.5
## 302 073 70.0 65.0 70.1 3 3 female 65.0
## 303 074 70.0 65.0 70.1 2 1 male 69.0
## 304 074 70.0 65.0 70.1 2 2 male 69.0
## 305 075 70.0 64.7 69.9 7 1 male 72.0
## 306 075 70.0 64.7 69.9 7 2 male 70.0
## 307 075 70.0 64.7 69.9 7 3 male 68.7
## 308 075 70.0 64.7 69.9 7 4 female 66.5
## 309 075 70.0 64.7 69.9 7 5 female 65.5
## 310 075 70.0 64.7 69.9 7 6 female 64.7
## 311 075 70.0 64.7 69.9 7 7 female 64.5
## 312 076 70.0 64.0 69.6 7 1 male 70.7
## 313 076 70.0 64.0 69.6 7 2 male 70.0
## 314 076 70.0 64.0 69.6 7 3 male 68.0
## 315 076 70.0 64.0 69.6 7 4 male 67.0
## 316 076 70.0 64.0 69.6 7 5 male 66.0
## 317 076 70.0 64.0 69.6 7 6 male 65.0
## 318 076 70.0 64.0 69.6 7 7 female 67.0
## 319 077 70.0 64.0 69.6 4 1 male 70.0
## 320 077 70.0 64.0 69.6 4 2 male 68.0
## 321 077 70.0 64.0 69.6 4 3 male 66.7
## 322 077 70.0 64.0 69.6 4 4 female 65.5
## 323 078 70.0 64.2 69.7 5 1 male 72.0
## 324 078 70.0 64.2 69.7 5 2 male 70.0
## 325 078 70.0 64.2 69.7 5 3 female 62.5
## 326 078 70.0 64.2 69.7 5 4 female 61.2
## 327 078 70.0 64.2 69.7 5 5 female 60.1
## 328 079 70.5 64.0 69.8 8 1 male 74.0
## 329 079 70.5 64.0 69.8 8 2 male 69.5
## 330 079 70.5 64.0 69.8 8 3 male 69.0
## 331 079 70.5 64.0 69.8 8 4 male 68.0
## 332 079 70.5 64.0 69.8 8 5 male 68.0
## 333 079 70.5 64.0 69.8 8 6 male 68.0
## 334 079 70.5 64.0 69.8 8 7 female 65.5
## 335 079 70.5 64.0 69.8 8 8 female 65.0
## 336 080 70.5 64.5 70.1 1 1 female 60.0
## 337 081 70.0 64.0 69.6 4 1 male 68.0
## 338 081 70.0 64.0 69.6 4 2 female 65.0
## 339 081 70.0 64.0 69.6 4 3 female 64.0
## 340 081 70.0 64.0 69.6 4 4 female 62.0
## 341 082 70.0 64.0 69.6 9 1 male 71.0
## 342 082 70.0 64.0 69.6 9 2 male 70.0
## 343 082 70.0 64.0 69.6 9 3 male 70.0
## 344 082 70.0 64.0 69.6 9 4 male 70.0
## 345 082 70.0 64.0 69.6 9 5 male 69.5
## 346 082 70.0 64.0 69.6 9 6 male 68.5
## 347 082 70.0 64.0 69.6 9 7 female 69.0
## 348 082 70.0 64.0 69.6 9 8 female 65.0
## 349 082 70.0 64.0 69.6 9 9 female 64.0
## 350 083 70.0 63.7 69.4 8 1 male 70.0
## 351 083 70.0 63.7 69.4 8 2 male 67.0
## 352 083 70.0 63.7 69.4 8 3 male 65.5
## 353 083 70.0 63.7 69.4 8 4 female 63.7
## 354 083 70.0 63.7 69.4 8 5 female 63.2
## 355 083 70.0 63.7 69.4 8 6 female 62.5
## 356 083 70.0 63.7 69.4 8 7 female 62.2
## 357 083 70.0 63.7 69.4 8 8 female 61.0
## 358 084 70.5 63.0 69.3 4 1 male 70.0
## 359 084 70.5 63.0 69.3 4 2 male 68.5
## 360 084 70.5 63.0 69.3 4 3 female 65.5
## 361 084 70.5 63.0 69.3 4 4 female 63.5
## 362 085 70.5 63.0 69.3 5 1 male 72.5
## 363 085 70.5 63.0 69.3 5 2 male 69.0
## 364 085 70.5 63.0 69.3 5 3 male 67.0
## 365 085 70.5 63.0 69.3 5 4 female 64.5
## 366 085 70.5 63.0 69.3 5 5 female 64.0
## 367 086 70.0 63.5 69.3 4 1 male 71.0
## 368 086 70.0 63.5 69.3 4 2 male 67.5
## 369 086 70.0 63.5 69.3 4 3 female 67.5
## 370 086 70.0 63.5 69.3 4 4 female 63.5
## 371 087 70.0 63.0 69.0 4 1 male 68.0
## 372 087 70.0 63.0 69.0 4 2 male 67.0
## 373 087 70.0 63.0 69.0 4 3 female 63.7
## 374 087 70.0 63.0 69.0 4 4 female 62.0
## 375 088 70.0 63.0 69.0 4 1 male 70.0
## 376 088 70.0 63.0 69.0 4 2 male 66.5
## 377 088 70.0 63.0 69.0 4 3 female 62.0
## 378 088 70.0 63.0 69.0 4 4 female 61.0
## 379 089 70.5 62.0 68.7 8 1 male 72.0
## 380 089 70.5 62.0 68.7 8 2 male 70.0
## 381 089 70.5 62.0 68.7 8 3 male 69.5
## 382 089 70.5 62.0 68.7 8 4 male 69.5
## 383 089 70.5 62.0 68.7 8 5 male 68.0
## 384 089 70.5 62.0 68.7 8 6 female 65.0
## 385 089 70.5 62.0 68.7 8 7 female 64.0
## 386 089 70.5 62.0 68.7 8 8 female 63.0
## 387 090 70.3 62.7 69.0 7 1 male 70.7
## 388 090 70.3 62.7 69.0 7 2 male 69.7
## 389 090 70.3 62.7 69.0 7 3 male 69.2
## 390 090 70.3 62.7 69.0 7 4 male 65.2
## 391 090 70.3 62.7 69.0 7 5 female 64.0
## 392 090 70.3 62.7 69.0 7 6 female 63.5
## 393 090 70.3 62.7 69.0 7 7 female 63.2
## 394 091 70.5 62.0 68.7 3 1 male 72.0
## 395 091 70.5 62.0 68.7 3 2 male 72.0
## 396 091 70.5 62.0 68.7 3 3 female 60.0
## 397 092 70.0 61.0 67.9 2 1 male 71.2
## 398 092 70.0 61.0 67.9 2 2 male 67.0
## 399 093 70.0 60.0 67.4 4 1 male 67.0
## 400 093 70.0 60.0 67.4 4 2 male 64.5
## 401 093 70.0 60.0 67.4 4 3 female 65.0
## 402 093 70.0 60.0 67.4 4 4 female 63.0
## 403 094 70.0 60.0 67.4 2 1 female 65.0
## 404 094 70.0 60.0 67.4 2 2 female 65.0
## 405 095 70.0 58.5 66.6 3 1 male 71.5
## 406 095 70.0 58.5 66.6 3 2 male 64.5
## 407 095 70.0 58.5 66.6 3 3 female 63.0
## 408 096 70.0 58.0 66.3 5 1 male 72.0
## 409 096 70.0 58.0 66.3 5 2 male 66.0
## 410 096 70.0 58.0 66.3 5 3 female 66.0
## 411 096 70.0 58.0 66.3 5 4 female 65.0
## 412 096 70.0 58.0 66.3 5 5 female 63.0
## 413 097 69.0 68.5 71.5 10 1 male 75.0
## 414 097 69.0 68.5 71.5 10 2 male 71.0
## 415 097 69.0 68.5 71.5 10 3 male 70.0
## 416 097 69.0 68.5 71.5 10 4 female 66.0
## 417 097 69.0 68.5 71.5 10 5 female 66.0
## 418 097 69.0 68.5 71.5 10 6 female 65.5
## 419 097 69.0 68.5 71.5 10 7 female 65.0
## 420 097 69.0 68.5 71.5 10 8 female 65.0
## 421 097 69.0 68.5 71.5 10 9 female 64.0
## 422 097 69.0 68.5 71.5 10 10 female 64.0
## 423 098 69.0 67.0 70.7 1 1 female 64.0
## 424 099 69.0 66.0 70.1 8 1 male 73.0
## 425 099 69.0 66.0 70.1 8 2 male 72.0
## 426 099 69.0 66.0 70.1 8 3 male 71.7
## 427 099 69.0 66.0 70.1 8 4 male 71.5
## 428 099 69.0 66.0 70.1 8 5 female 65.5
## 429 099 69.0 66.0 70.1 8 6 female 65.0
## 430 099 69.0 66.0 70.1 8 7 female 62.7
## 431 099 69.0 66.0 70.1 8 8 female 62.5
## 432 100 69.0 66.0 70.1 3 1 male 71.2
## 433 100 69.0 66.0 70.1 3 2 male 71.0
## 434 100 69.0 66.0 70.1 3 3 male 70.0
## 435 101 69.0 66.7 70.5 4 1 male 75.0
## 436 101 69.0 66.7 70.5 4 2 male 74.0
## 437 101 69.0 66.7 70.5 4 3 male 72.0
## 438 101 69.0 66.7 70.5 4 4 male 68.5
## 439 102 69.0 66.0 70.1 6 1 male 70.0
## 440 102 69.0 66.0 70.1 6 2 male 68.5
## 441 102 69.0 66.0 70.1 6 3 male 68.0
## 442 102 69.0 66.0 70.1 6 4 female 65.0
## 443 102 69.0 66.0 70.1 6 5 female 63.0
## 444 102 69.0 66.0 70.1 6 6 female 62.5
## 445 103 69.0 66.5 70.4 7 1 male 73.0
## 446 103 69.0 66.5 70.4 7 2 male 71.0
## 447 103 69.0 66.5 70.4 7 3 male 70.5
## 448 103 69.0 66.5 70.4 7 4 male 70.5
## 449 103 69.0 66.5 70.4 7 5 male 67.0
## 450 103 69.0 66.5 70.4 7 6 male 66.0
## 451 103 69.0 66.5 70.4 7 7 female 61.0
## 452 104 69.5 66.5 70.7 4 1 male 70.5
## 453 104 69.5 66.5 70.7 4 2 male 67.5
## 454 104 69.5 66.5 70.7 4 3 female 64.5
## 455 104 69.5 66.5 70.7 4 4 female 64.0
## 456 105 69.0 66.5 70.4 6 1 male 71.0
## 457 105 69.0 66.5 70.4 6 2 female 68.5
## 458 105 69.0 66.5 70.4 6 3 female 67.5
## 459 105 69.0 66.5 70.4 6 4 female 66.0
## 460 105 69.0 66.5 70.4 6 5 female 63.0
## 461 105 69.0 66.5 70.4 6 6 female 63.0
## 462 106 69.5 66.0 70.4 7 1 male 71.0
## 463 106 69.5 66.0 70.4 7 2 male 71.0
## 464 106 69.5 66.0 70.4 7 3 male 70.5
## 465 106 69.5 66.0 70.4 7 4 male 70.5
## 466 106 69.5 66.0 70.4 7 5 female 66.5
## 467 106 69.5 66.0 70.4 7 6 female 65.5
## 468 106 69.5 66.0 70.4 7 7 female 64.5
## 469 107 69.0 66.0 70.1 9 1 male 73.0
## 470 107 69.0 66.0 70.1 9 2 male 72.0
## 471 107 69.0 66.0 70.1 9 3 male 69.0
## 472 107 69.0 66.0 70.1 9 4 male 69.0
## 473 107 69.0 66.0 70.1 9 5 female 66.5
## 474 107 69.0 66.0 70.1 9 6 female 65.5
## 475 107 69.0 66.0 70.1 9 7 female 65.5
## 476 107 69.0 66.0 70.1 9 8 female 65.0
## 477 107 69.0 66.0 70.1 9 9 female 64.0
## 478 108 69.0 65.0 69.6 7 1 male 70.0
## 479 108 69.0 65.0 69.6 7 2 male 68.5
## 480 108 69.0 65.0 69.6 7 3 male 67.0
## 481 108 69.0 65.0 69.6 7 4 female 65.0
## 482 108 69.0 65.0 69.6 7 5 female 64.0
## 483 108 69.0 65.0 69.6 7 6 female 63.5
## 484 108 69.0 65.0 69.6 7 7 female 61.0
## 485 109 69.5 64.5 69.6 7 1 male 69.7
## 486 109 69.5 64.5 69.6 7 2 male 68.0
## 487 109 69.5 64.5 69.6 7 3 male 60.0
## 488 109 69.5 64.5 69.6 7 4 female 65.2
## 489 109 69.5 64.5 69.6 7 5 female 64.5
## 490 109 69.5 64.5 69.6 7 6 female 63.7
## 491 109 69.5 64.5 69.6 7 7 female 60.0
## 492 110 69.2 64.0 69.2 4 1 male 71.7
## 493 110 69.2 64.0 69.2 4 2 male 66.5
## 494 110 69.2 64.0 69.2 4 3 female 65.0
## 495 110 69.2 64.0 69.2 4 4 female 63.5
## 496 111 69.0 63.5 68.8 1 1 female 65.5
## 497 112 69.0 63.0 68.5 3 1 male 69.0
## 498 112 69.0 63.0 68.5 3 2 female 67.5
## 499 112 69.0 63.0 68.5 3 3 female 63.5
## 500 113 69.0 63.0 68.5 1 1 male 72.0
## 501 114 69.0 63.0 68.5 6 1 male 73.0
## 502 114 69.0 63.0 68.5 6 2 male 70.0
## 503 114 69.0 63.0 68.5 6 3 male 70.0
## 504 114 69.0 63.0 68.5 6 4 male 64.0
## 505 114 69.0 63.0 68.5 6 5 female 66.0
## 506 114 69.0 63.0 68.5 6 6 female 62.0
## 507 115 69.0 63.5 68.8 7 1 male 70.5
## 508 115 69.0 63.5 68.8 7 2 male 67.0
## 509 115 69.0 63.5 68.8 7 3 male 66.0
## 510 115 69.0 63.5 68.8 7 4 female 65.0
## 511 115 69.0 63.5 68.8 7 5 female 63.0
## 512 115 69.0 63.5 68.8 7 6 female 62.0
## 513 115 69.0 63.5 68.8 7 7 female 61.0
## 514 116 69.0 63.5 68.8 3 1 male 70.5
## 515 116 69.0 63.5 68.8 3 2 female 63.7
## 516 116 69.0 63.5 68.8 3 3 female 63.0
## 517 117 69.7 62.0 68.3 1 1 female 62.5
## 518 118 69.5 62.0 68.2 3 1 male 73.0
## 519 118 69.5 62.0 68.2 3 2 male 72.0
## 520 118 69.5 62.0 68.2 3 3 male 69.0
## 521 119 69.0 62.0 68.0 5 1 male 73.0
## 522 119 69.0 62.0 68.0 5 2 male 71.0
## 523 119 69.0 62.0 68.0 5 3 male 71.0
## 524 119 69.0 62.0 68.0 5 4 male 69.0
## 525 119 69.0 62.0 68.0 5 5 female 63.0
## 526 120 69.5 62.0 68.2 11 1 male 72.0
## 527 120 69.5 62.0 68.2 11 2 male 70.0
## 528 120 69.5 62.0 68.2 11 3 male 67.8
## 529 120 69.5 62.0 68.2 11 4 female 65.2
## 530 120 69.5 62.0 68.2 11 5 female 64.7
## 531 120 69.5 62.0 68.2 11 6 female 64.5
## 532 120 69.5 62.0 68.2 11 7 female 63.5
## 533 120 69.5 62.0 68.2 11 8 female 63.5
## 534 120 69.5 62.0 68.2 11 9 female 62.5
## 535 120 69.5 62.0 68.2 11 10 female 62.0
## 536 120 69.5 62.0 68.2 11 11 female 61.5
## 537 121 69.0 62.5 68.2 8 1 male 71.0
## 538 121 69.0 62.5 68.2 8 2 male 70.0
## 539 121 69.0 62.5 68.2 8 3 male 70.0
## 540 121 69.0 62.5 68.2 8 4 male 69.0
## 541 121 69.0 62.5 68.2 8 5 female 63.5
## 542 121 69.0 62.5 68.2 8 6 female 62.5
## 543 121 69.0 62.5 68.2 8 7 female 62.5
## 544 121 69.0 62.5 68.2 8 8 female 62.0
## 545 122 69.0 62.0 68.0 4 1 male 72.0
## 546 122 69.0 62.0 68.0 4 2 male 68.0
## 547 122 69.0 62.0 68.0 4 3 female 66.0
## 548 122 69.0 62.0 68.0 4 4 female 66.0
## 549 123 69.5 61.0 67.7 5 1 male 70.0
## 550 123 69.5 61.0 67.7 5 2 male 69.5
## 551 123 69.5 61.0 67.7 5 3 male 69.0
## 552 123 69.5 61.0 67.7 5 4 female 63.0
## 553 123 69.5 61.0 67.7 5 5 female 62.0
## 554 124 69.0 61.0 67.4 9 1 male 68.0
## 555 124 69.0 61.0 67.4 9 2 male 68.0
## 556 124 69.0 61.0 67.4 9 3 male 67.5
## 557 124 69.0 61.0 67.4 9 4 male 64.0
## 558 124 69.0 61.0 67.4 9 5 male 63.0
## 559 124 69.0 61.0 67.4 9 6 male 63.0
## 560 124 69.0 61.0 67.4 9 7 female 63.5
## 561 124 69.0 61.0 67.4 9 8 female 62.0
## 562 124 69.0 61.0 67.4 9 9 female 62.0
## 563 125 69.0 60.0 66.9 3 1 male 70.5
## 564 125 69.0 60.0 66.9 3 2 female 68.0
## 565 125 69.0 60.0 66.9 3 3 female 62.5
## 566 126 69.0 60.0 66.9 4 1 male 69.0
## 567 126 69.0 60.0 66.9 4 2 male 66.0
## 568 126 69.0 60.0 66.9 4 3 female 61.7
## 569 126 69.0 60.0 66.9 4 4 female 60.5
## 570 127 69.0 60.5 67.2 1 1 male 69.5
## 571 128 68.7 70.5 72.4 2 1 male 71.0
## 572 128 68.7 70.5 72.4 2 2 female 61.7
## 573 129 68.5 67.0 70.4 3 1 male 73.0
## 574 129 68.5 67.0 70.4 3 2 male 71.0
## 575 129 68.5 67.0 70.4 3 3 female 67.0
## 576 130 68.5 66.5 70.2 11 1 male 70.0
## 577 130 68.5 66.5 70.2 11 2 male 69.0
## 578 130 68.5 66.5 70.2 11 3 male 69.0
## 579 130 68.5 66.5 70.2 11 4 male 68.7
## 580 130 68.5 66.5 70.2 11 5 male 68.5
## 581 130 68.5 66.5 70.2 11 6 male 68.5
## 582 130 68.5 66.5 70.2 11 7 male 68.0
## 583 130 68.5 66.5 70.2 11 8 male 68.0
## 584 130 68.5 66.5 70.2 11 9 male 68.0
## 585 130 68.5 66.5 70.2 11 10 male 66.2
## 586 130 68.5 66.5 70.2 11 11 female 63.2
## 587 131 68.0 65.0 69.1 2 1 male 67.5
## 588 131 68.0 65.0 69.1 2 2 male 66.0
## 589 132 68.0 65.5 69.4 2 1 male 66.0
## 590 132 68.0 65.5 69.4 2 2 female 64.0
## 591 133 68.0 65.5 69.4 7 1 male 71.7
## 592 133 68.0 65.5 69.4 7 2 male 71.5
## 593 133 68.0 65.5 69.4 7 3 male 70.7
## 594 133 68.0 65.5 69.4 7 4 male 65.5
## 595 133 68.0 65.5 69.4 7 5 female 66.5
## 596 133 68.0 65.5 69.4 7 6 female 65.2
## 597 133 68.0 65.5 69.4 7 7 female 61.5
## 598 134 68.0 65.0 69.1 4 1 male 72.0
## 599 134 68.0 65.0 69.1 4 2 male 72.0
## 600 134 68.0 65.0 69.1 4 3 female 68.0
## 601 134 68.0 65.0 69.1 4 4 female 66.0
## 602 135 68.5 65.0 69.3 8 1 male 69.2
## 603 135 68.5 65.0 69.3 8 2 male 68.0
## 604 135 68.5 65.0 69.3 8 3 male 66.0
## 605 135 68.5 65.0 69.3 8 4 male 66.0
## 606 135 68.5 65.0 69.3 8 5 female 62.0
## 607 135 68.5 65.0 69.3 8 6 female 61.5
## 608 135 68.5 65.0 69.3 8 7 female 61.0
## 609 135 68.5 65.0 69.3 8 8 female 60.0
## 610 136A 68.5 65.0 69.3 8 1 male 72.0
## 611 136A 68.5 65.0 69.3 8 2 male 70.5
## 612 136A 68.5 65.0 69.3 8 3 male 68.7
## 613 136A 68.5 65.0 69.3 8 4 male 68.5
## 614 136A 68.5 65.0 69.3 8 5 male 67.7
## 615 136A 68.5 65.0 69.3 8 6 female 64.0
## 616 136A 68.5 65.0 69.3 8 7 female 63.5
## 617 136A 68.5 65.0 69.3 8 8 female 63.0
## 618 136 68.0 64.0 68.6 10 1 male 71.0
## 619 136 68.0 64.0 68.6 10 2 male 68.0
## 620 136 68.0 64.0 68.6 10 3 male 68.0
## 621 136 68.0 64.0 68.6 10 4 male 67.0
## 622 136 68.0 64.0 68.6 10 5 female 65.0
## 623 136 68.0 64.0 68.6 10 6 female 64.0
## 624 136 68.0 64.0 68.6 10 7 female 63.0
## 625 136 68.0 64.0 68.6 10 8 female 63.0
## 626 136 68.0 64.0 68.6 10 9 female 62.0
## 627 136 68.0 64.0 68.6 10 10 female 61.0
## 628 137 68.0 64.0 68.6 4 1 male 66.0
## 629 137 68.0 64.0 68.6 4 2 male 63.0
## 630 137 68.0 64.0 68.6 4 3 female 65.5
## 631 137 68.0 64.0 68.6 4 4 female 62.0
## 632 138 68.0 64.0 68.6 5 1 male 71.2
## 633 138 68.0 64.0 68.6 5 2 male 71.2
## 634 138 68.0 64.0 68.6 5 3 male 69.0
## 635 138 68.0 64.0 68.6 5 4 male 68.5
## 636 138 68.0 64.0 68.6 5 5 female 62.5
## 637 139 68.0 64.5 68.8 1 1 female 62.0
## 638 140 68.0 64.0 68.6 10 1 male 69.0
## 639 140 68.0 64.0 68.6 10 2 male 67.0
## 640 140 68.0 64.0 68.6 10 3 male 66.0
## 641 140 68.0 64.0 68.6 10 4 female 66.0
## 642 140 68.0 64.0 68.6 10 5 female 66.0
## 643 140 68.0 64.0 68.6 10 6 female 65.0
## 644 140 68.0 64.0 68.6 10 7 female 65.0
## 645 140 68.0 64.0 68.6 10 8 female 65.0
## 646 140 68.0 64.0 68.6 10 9 female 64.0
## 647 140 68.0 64.0 68.6 10 10 female 63.0
## 648 141 68.0 63.0 68.0 8 1 male 70.5
## 649 141 68.0 63.0 68.0 8 2 male 70.0
## 650 141 68.0 63.0 68.0 8 3 male 68.0
## 651 141 68.0 63.0 68.0 8 4 male 66.0
## 652 141 68.0 63.0 68.0 8 5 male 66.0
## 653 141 68.0 63.0 68.0 8 6 female 66.0
## 654 141 68.0 63.0 68.0 8 7 female 62.0
## 655 141 68.0 63.0 68.0 8 8 female 61.5
## 656 142 68.5 63.5 68.5 4 1 male 73.5
## 657 142 68.5 63.5 68.5 4 2 male 70.0
## 658 142 68.5 63.5 68.5 4 3 male 69.5
## 659 142 68.5 63.5 68.5 4 4 female 65.5
## 660 143 68.0 63.0 68.0 1 1 male 67.0
## 661 144 68.0 63.0 68.0 4 1 male 70.0
## 662 144 68.0 63.0 68.0 4 2 male 68.0
## 663 144 68.0 63.0 68.0 4 3 female 64.5
## 664 144 68.0 63.0 68.0 4 4 female 64.0
## 665 145 68.0 63.0 68.0 8 1 male 71.0
## 666 145 68.0 63.0 68.0 8 2 male 68.0
## 667 145 68.0 63.0 68.0 8 3 male 66.0
## 668 145 68.0 63.0 68.0 8 4 male 65.5
## 669 145 68.0 63.0 68.0 8 5 male 65.0
## 670 145 68.0 63.0 68.0 8 6 female 63.0
## 671 145 68.0 63.0 68.0 8 7 female 62.0
## 672 145 68.0 63.0 68.0 8 8 female 62.0
## 673 146 68.0 63.0 68.0 6 1 male 67.0
## 674 146 68.0 63.0 68.0 6 2 male 67.0
## 675 146 68.0 63.0 68.0 6 3 male 66.0
## 676 146 68.0 63.0 68.0 6 4 female 64.0
## 677 146 68.0 63.0 68.0 6 5 female 63.5
## 678 146 68.0 63.0 68.0 6 6 female 61.0
## 679 147 68.5 63.5 68.5 1 1 male 68.2
## 680 148 68.0 63.0 68.0 1 1 male 70.0
## 681 149 68.2 63.5 68.4 5 1 male 70.0
## 682 149 68.2 63.5 68.4 5 2 male 69.0
## 683 149 68.2 63.5 68.4 5 3 male 67.0
## 684 149 68.2 63.5 68.4 5 4 male 65.5
## 685 149 68.2 63.5 68.4 5 5 female 64.5
## 686 150 68.0 62.5 67.8 1 1 male 68.5
## 687 151 68.7 62.0 67.8 2 1 male 67.7
## 688 151 68.7 62.0 67.8 2 2 female 61.7
## 689 152 68.0 62.5 67.8 1 1 male 66.5
## 690 153 68.0 61.0 66.9 5 1 male 68.5
## 691 153 68.0 61.0 66.9 5 2 male 68.0
## 692 153 68.0 61.0 66.9 5 3 male 64.0
## 693 153 68.0 61.0 66.9 5 4 female 63.5
## 694 153 68.0 61.0 66.9 5 5 female 63.0
## 695 154 68.0 60.2 66.5 1 1 male 66.7
## 696 155 68.0 60.0 66.4 7 1 male 64.0
## 697 155 68.0 60.0 66.4 7 2 female 61.0
## 698 155 68.0 60.0 66.4 7 3 female 61.0
## 699 155 68.0 60.0 66.4 7 4 female 60.0
## 700 155 68.0 60.0 66.4 7 5 female 60.0
## 701 155 68.0 60.0 66.4 7 6 female 60.0
## 702 155 68.0 60.0 66.4 7 7 female 56.0
## 703 156 68.0 60.0 66.4 4 1 male 67.5
## 704 156 68.0 60.0 66.4 4 2 male 67.0
## 705 156 68.0 60.0 66.4 4 3 male 66.5
## 706 156 68.0 60.0 66.4 4 4 female 60.0
## 707 157 68.5 59.0 66.1 1 1 male 69.0
## 708 158 68.0 59.0 65.9 10 1 male 68.0
## 709 158 68.0 59.0 65.9 10 2 male 65.0
## 710 158 68.0 59.0 65.9 10 3 male 64.7
## 711 158 68.0 59.0 65.9 10 4 male 64.0
## 712 158 68.0 59.0 65.9 10 5 male 64.0
## 713 158 68.0 59.0 65.9 10 6 male 63.0
## 714 158 68.0 59.0 65.9 10 7 female 65.0
## 715 158 68.0 59.0 65.9 10 8 female 65.0
## 716 158 68.0 59.0 65.9 10 9 female 62.0
## 717 158 68.0 59.0 65.9 10 10 female 61.0
## 718 159 67.0 66.2 69.2 5 1 male 72.7
## 719 159 67.0 66.2 69.2 5 2 male 72.7
## 720 159 67.0 66.2 69.2 5 3 male 71.5
## 721 159 67.0 66.2 69.2 5 4 female 65.5
## 722 159 67.0 66.2 69.2 5 5 female 63.5
## 723 160 67.0 66.5 69.4 1 1 male 71.0
## 724 161 67.0 66.0 69.1 8 1 male 73.0
## 725 161 67.0 66.0 69.1 8 2 male 71.0
## 726 161 67.0 66.0 69.1 8 3 male 70.7
## 727 161 67.0 66.0 69.1 8 4 male 70.0
## 728 161 67.0 66.0 69.1 8 5 male 69.0
## 729 161 67.0 66.0 69.1 8 6 female 68.0
## 730 161 67.0 66.0 69.1 8 7 female 65.5
## 731 161 67.0 66.0 69.1 8 8 female 62.0
## 732 162 67.0 65.0 68.6 6 1 male 69.7
## 733 162 67.0 65.0 68.6 6 2 male 67.5
## 734 162 67.0 65.0 68.6 6 3 female 65.5
## 735 162 67.0 65.0 68.6 6 4 female 65.0
## 736 162 67.0 65.0 68.6 6 5 female 64.5
## 737 162 67.0 65.0 68.6 6 6 female 63.5
## 738 163 67.0 65.5 68.9 5 1 male 70.0
## 739 163 67.0 65.5 68.9 5 2 male 69.0
## 740 163 67.0 65.5 68.9 5 3 female 65.5
## 741 163 67.0 65.5 68.9 5 4 female 65.5
## 742 163 67.0 65.5 68.9 5 5 female 63.0
## 743 164 67.0 65.5 68.9 4 1 male 70.0
## 744 164 67.0 65.5 68.9 4 2 male 67.7
## 745 164 67.0 65.5 68.9 4 3 female 63.0
## 746 164 67.0 65.5 68.9 4 4 female 60.0
## 747 165 67.0 65.0 68.6 3 1 male 65.0
## 748 165 67.0 65.0 68.6 3 2 female 62.0
## 749 165 67.0 65.0 68.6 3 3 female 62.0
## 750 166 67.5 65.0 68.8 11 1 male 71.0
## 751 166 67.5 65.0 68.8 11 2 male 69.0
## 752 166 67.5 65.0 68.8 11 3 female 64.0
## 753 166 67.5 65.0 68.8 11 4 female 64.0
## 754 166 67.5 65.0 68.8 11 5 female 63.0
## 755 166 67.5 65.0 68.8 11 6 female 63.0
## 756 166 67.5 65.0 68.8 11 7 female 63.0
## 757 166 67.5 65.0 68.8 11 8 female 63.0
## 758 166 67.5 65.0 68.8 11 9 female 63.0
## 759 166 67.5 65.0 68.8 11 10 female 62.5
## 760 166 67.5 65.0 68.8 11 11 female 62.0
## 761 167 67.0 64.0 68.1 4 1 male 71.5
## 762 167 67.0 64.0 68.1 4 2 male 70.0
## 763 167 67.0 64.0 68.1 4 3 male 67.0
## 764 167 67.0 64.0 68.1 4 4 male 67.0
## 765 168 67.0 63.5 67.8 8 1 male 71.0
## 766 168 67.0 63.5 67.8 8 2 male 70.2
## 767 168 67.0 63.5 67.8 8 3 male 69.2
## 768 168 67.0 63.5 67.8 8 4 male 68.5
## 769 168 67.0 63.5 67.8 8 5 male 68.0
## 770 168 67.0 63.5 67.8 8 6 male 67.0
## 771 168 67.0 63.5 67.8 8 7 male 65.5
## 772 168 67.0 63.5 67.8 8 8 female 63.5
## 773 169 67.0 63.0 67.5 3 1 male 69.0
## 774 169 67.0 63.0 67.5 3 2 male 68.0
## 775 169 67.0 63.0 67.5 3 3 female 63.0
## 776 170 67.5 62.0 67.2 5 1 male 70.0
## 777 170 67.5 62.0 67.2 5 2 male 69.5
## 778 170 67.5 62.0 67.2 5 3 male 69.0
## 779 170 67.5 62.0 67.2 5 4 male 68.5
## 780 170 67.5 62.0 67.2 5 5 female 66.0
## 781 171 67.0 61.0 66.4 1 1 male 67.0
## 782 172 66.0 67.0 69.2 8 1 male 70.5
## 783 172 66.0 67.0 69.2 8 2 male 70.5
## 784 172 66.0 67.0 69.2 8 3 male 67.0
## 785 172 66.0 67.0 69.2 8 4 male 66.0
## 786 172 66.0 67.0 69.2 8 5 male 66.0
## 787 172 66.0 67.0 69.2 8 6 female 62.0
## 788 172 66.0 67.0 69.2 8 7 female 62.0
## 789 172 66.0 67.0 69.2 8 8 female 61.5
## 790 173 66.0 67.0 69.2 9 1 male 72.0
## 791 173 66.0 67.0 69.2 9 2 male 65.0
## 792 173 66.0 67.0 69.2 9 3 male 65.0
## 793 173 66.0 67.0 69.2 9 4 female 67.0
## 794 173 66.0 67.0 69.2 9 5 female 64.0
## 795 173 66.0 67.0 69.2 9 6 female 64.0
## 796 173 66.0 67.0 69.2 9 7 female 62.0
## 797 173 66.0 67.0 69.2 9 8 female 60.0
## 798 173 66.0 67.0 69.2 9 9 female 60.0
## 799 174 66.0 66.0 68.6 5 1 male 66.0
## 800 174 66.0 66.0 68.6 5 2 male 65.0
## 801 174 66.0 66.0 68.6 5 3 female 67.0
## 802 174 66.0 66.0 68.6 5 4 female 66.5
## 803 174 66.0 66.0 68.6 5 5 female 65.5
## 804 175 66.0 66.0 68.6 6 1 male 72.0
## 805 175 66.0 66.0 68.6 6 2 male 68.0
## 806 175 66.0 66.0 68.6 6 3 female 66.0
## 807 175 66.0 66.0 68.6 6 4 female 65.0
## 808 175 66.0 66.0 68.6 6 5 female 62.0
## 809 175 66.0 66.0 68.6 6 6 female 61.0
## 810 176 66.5 65.0 68.3 8 1 male 68.7
## 811 176 66.5 65.0 68.3 8 2 male 68.5
## 812 176 66.5 65.0 68.3 8 3 male 66.5
## 813 176 66.5 65.0 68.3 8 4 male 64.5
## 814 176 66.5 65.0 68.3 8 5 female 62.5
## 815 176 66.5 65.0 68.3 8 6 female 60.5
## 816 176 66.5 65.0 68.3 8 7 female 60.5
## 817 176 66.5 65.0 68.3 8 8 female 57.5
## 818 177 66.0 65.5 68.4 5 1 male 72.0
## 819 177 66.0 65.5 68.4 5 2 male 71.0
## 820 177 66.0 65.5 68.4 5 3 male 67.0
## 821 177 66.0 65.5 68.4 5 4 female 66.0
## 822 177 66.0 65.5 68.4 5 5 female 65.0
## 823 178 66.0 63.0 67.0 1 1 male 70.0
## 824 179 66.0 63.5 67.3 2 1 female 64.5
## 825 179 66.0 63.5 67.3 2 2 female 62.0
## 826 180 66.5 63.0 67.3 6 1 male 67.2
## 827 180 66.5 63.0 67.3 6 2 male 67.0
## 828 180 66.5 63.0 67.3 6 3 male 65.0
## 829 180 66.5 63.0 67.3 6 4 female 65.0
## 830 180 66.5 63.0 67.3 6 5 female 65.0
## 831 180 66.5 63.0 67.3 6 6 female 63.0
## 832 181 66.5 62.5 67.0 7 1 male 70.0
## 833 181 66.5 62.5 67.0 7 2 male 68.0
## 834 181 66.5 62.5 67.0 7 3 female 63.5
## 835 181 66.5 62.5 67.0 7 4 female 62.5
## 836 181 66.5 62.5 67.0 7 5 female 62.5
## 837 181 66.5 62.5 67.0 7 6 female 62.5
## 838 181 66.5 62.5 67.0 7 7 female 62.5
## 839 182 66.0 61.5 66.2 1 1 male 70.0
## 840 183 66.0 60.0 65.4 4 1 male 68.0
## 841 183 66.0 60.0 65.4 4 2 male 67.0
## 842 183 66.0 60.0 65.4 4 3 male 65.0
## 843 183 66.0 60.0 65.4 4 4 female 60.0
## 844 184 66.0 60.0 65.4 1 1 male 65.0
## 845 185 66.0 59.0 64.9 15 1 male 68.0
## 846 185 66.0 59.0 64.9 15 2 male 67.0
## 847 185 66.0 59.0 64.9 15 3 male 66.5
## 848 185 66.0 59.0 64.9 15 4 male 66.0
## 849 185 66.0 59.0 64.9 15 5 male 65.7
## 850 185 66.0 59.0 64.9 15 6 male 65.5
## 851 185 66.0 59.0 64.9 15 7 male 65.0
## 852 185 66.0 59.0 64.9 15 8 female 65.0
## 853 185 66.0 59.0 64.9 15 9 female 64.0
## 854 185 66.0 59.0 64.9 15 10 female 63.0
## 855 185 66.0 59.0 64.9 15 11 female 62.0
## 856 185 66.0 59.0 64.9 15 12 female 61.0
## 857 185 66.0 59.0 64.9 15 13 female 60.0
## 858 185 66.0 59.0 64.9 15 14 female 58.0
## 859 185 66.0 59.0 64.9 15 15 female 57.0
## 860 186 65.0 67.0 68.7 4 1 male 66.5
## 861 186 65.0 67.0 68.7 4 2 male 66.0
## 862 186 65.0 67.0 68.7 4 3 male 66.0
## 863 186 65.0 67.0 68.7 4 4 female 65.0
## 864 187 65.0 67.0 68.7 1 1 female 63.0
## 865 188 65.0 66.0 68.1 4 1 male 63.0
## 866 188 65.0 66.0 68.1 4 2 female 63.0
## 867 188 65.0 66.0 68.1 4 3 female 63.0
## 868 188 65.0 66.0 68.1 4 4 female 60.0
## 869 189 65.0 66.0 68.1 5 1 male 67.0
## 870 189 65.0 66.0 68.1 5 2 male 66.0
## 871 189 65.0 66.0 68.1 5 3 male 65.0
## 872 189 65.0 66.0 68.1 5 4 female 65.0
## 873 189 65.0 66.0 68.1 5 5 female 61.0
## 874 190 65.0 65.0 67.6 9 1 male 69.0
## 875 190 65.0 65.0 67.6 9 2 male 68.0
## 876 190 65.0 65.0 67.6 9 3 male 68.0
## 877 190 65.0 65.0 67.6 9 4 female 65.0
## 878 190 65.0 65.0 67.6 9 5 female 65.0
## 879 190 65.0 65.0 67.6 9 6 female 62.0
## 880 190 65.0 65.0 67.6 9 7 female 62.0
## 881 190 65.0 65.0 67.6 9 8 female 61.0
## 882 190 65.0 65.0 67.6 9 9 female 59.0
## 883 191 65.0 65.5 67.9 2 1 male 70.7
## 884 191 65.0 65.5 67.9 2 2 female 65.5
## 885 192 65.0 65.0 67.6 6 1 male 69.2
## 886 192 65.0 65.0 67.6 6 2 male 69.0
## 887 192 65.0 65.0 67.6 6 3 male 68.0
## 888 192 65.0 65.0 67.6 6 4 male 67.7
## 889 192 65.0 65.0 67.6 6 5 female 64.5
## 890 192 65.0 65.0 67.6 6 6 female 60.5
## 891 193 65.0 64.0 67.1 6 1 male 67.0
## 892 193 65.0 64.0 67.1 6 2 male 67.0
## 893 193 65.0 64.0 67.1 6 3 female 64.0
## 894 193 65.0 64.0 67.1 6 4 female 64.0
## 895 193 65.0 64.0 67.1 6 5 female 62.5
## 896 193 65.0 64.0 67.1 6 6 female 60.5
## 897 194 65.0 63.0 66.5 2 1 male 70.0
## 898 194 65.0 63.0 66.5 2 2 female 63.0
## 899 195 65.0 63.0 66.5 3 1 male 66.0
## 900 195 65.0 63.0 66.5 3 2 male 66.0
## 901 195 65.0 63.0 66.5 3 3 female 63.0
## 902 196 65.5 63.0 66.8 4 1 male 71.0
## 903 196 65.5 63.0 66.8 4 2 male 71.0
## 904 196 65.5 63.0 66.8 4 3 male 69.0
## 905 196 65.5 63.0 66.8 4 4 female 63.5
## 906 197 65.5 60.0 65.2 5 1 male 68.0
## 907 197 65.5 60.0 65.2 5 2 male 68.0
## 908 197 65.5 60.0 65.2 5 3 male 67.0
## 909 197 65.5 60.0 65.2 5 4 male 67.0
## 910 197 65.5 60.0 65.2 5 5 female 62.0
## 911 198 64.0 64.0 66.6 7 1 male 71.5
## 912 198 64.0 64.0 66.6 7 2 male 68.0
## 913 198 64.0 64.0 66.6 7 3 female 65.5
## 914 198 64.0 64.0 66.6 7 4 female 64.0
## 915 198 64.0 64.0 66.6 7 5 female 62.0
## 916 198 64.0 64.0 66.6 7 6 female 62.0
## 917 198 64.0 64.0 66.6 7 7 female 61.0
## 918 199 64.0 64.0 66.6 7 1 male 70.5
## 919 199 64.0 64.0 66.6 7 2 male 68.0
## 920 199 64.0 64.0 66.6 7 3 female 67.0
## 921 199 64.0 64.0 66.6 7 4 female 65.0
## 922 199 64.0 64.0 66.6 7 5 female 64.0
## 923 199 64.0 64.0 66.6 7 6 female 64.0
## 924 199 64.0 64.0 66.6 7 7 female 60.0
## 925 200 64.0 63.0 66.0 1 1 male 64.5
## 926 201 64.0 60.0 64.4 2 1 male 66.0
## 927 201 64.0 60.0 64.4 2 2 female 60.0
## 928 202 63.0 63.5 65.8 2 1 female 68.5
## 929 202 63.0 63.5 65.8 2 2 female 63.5
## 930 203 62.0 66.0 66.6 3 1 male 64.0
## 931 203 62.0 66.0 66.6 3 2 female 62.0
## 932 203 62.0 66.0 66.6 3 3 female 61.0
## 933 204 62.5 63.0 65.3 2 1 male 66.5
## 934 204 62.5 63.0 65.3 2 2 female 57.0
GaltonFamilies %>%
filter(childNum == 1 & gender == "male") %>%
select(father, childHeight) %>%
rename(son = childHeight) %>%
do(tidy(lm(father ~ son, data = .)))
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 34.0 4.57 7.44 4.31e-12
## 2 son 0.499 0.0648 7.70 9.47e-13
how about we fit a following model with x representing father height and y representing son height we do get a statistically significant result.png
[][Confounders are perhaps the most common reason that leads to associations being misinterpreted]. If X and Y are correlated, we call Z a confounder if changes in Z cause changes in both X and Y. Earlier, when studying baseball data, we saw how home runs was a confounder that resulted in higher correlation than expected when studying the relationship between bases and balls and runs. In some cases, we can use linear models to account for confounders, as we did in the baseball example. But it is not always possible. Incorrect interpretation due to confounders are ubiquitous in the lay press. They are sometimes hard to detect.
We examined admission data from UC Berkeley majors from 1973 that showed that more men were being admitted than women. 44% men were admitted compared to 30% women (I question this way of calculation, ). Here’s the data. The percent of men and women that got accepted can be computed from this data using this simple piece of code. A statistical test, the [][chi-squared test] that we learned about in a previous course, clearly rejects the hypothesis that gender and admissions are independent. The p value is very small.
But closer inspection shows a paradoxical result. Here are the percent of admissions by major. Four out of the six majors favor women. But more importantly, all the differences are much smaller than the 14% difference that we see when examining the totals. The paradox is that analyzing the totals suggest a dependence between admissions and gender. But when the data is grouped by major, this dependence seems to disappear (you are only considering the absolute value of admitted, that the UC Berkeley demand, not the admission rate of student). What’s going on? This actually can happen if an uncounted confounder is driving most of the variability.
So let’s define three variables. X is 1 for men and 0 for women, Y is 1 for those admitted and 0 otherwise, and Z quantifies how selective the major is. A gender bias claim would be based on the fact that this probability is higher when X is 1 than when X is 0. But Z is an important confounder. Clearly, Z is associated with Y, because the more selective a major, the lower the probability that someone enters that major. But is major selectivity, which we call Z, associated with gender? One way to see this is to plot the total percent admitted to a major versus the percent of women that make up the applicants. We can see that in this plot. There seems to be an association.
The plot suggests that women were much more likely to apply to the two hard majors. Gender and major selectivity are confounded. Compare, for example, major B and major E. Major E is much harder to enter than major B. And over 60% of applicants to major E were women, while less than 30% of the applicants of major B were women. The following plot shows the percentage of applicants that were accepted by gender. The color here represents major. It also breaks down the acceptance rates by major. The size of the colored bar represent the percent of each major students that were admitted to. This breakdown lets us see that the majority of accepted men came from two majors, A and B. It also lets us see that few women apply to these two easy majors.
What the plot does not show us is what’s the percent admitted by major. In this plot, we can see that if we condition or stratify by major, and then look at differences, we control for the confounder, and the effect goes away. Now we see that major by major, there’s not much difference. The size of the dot represents the number of applicants, and explains the paradox. We see large red dots and small blue dots for the easiest majors, A and B. If we first stratify by major, compute the difference, and then average, we find that the percent difference is actually quite small. This is actually an example of something that is called Simpson’s paradox, which we will describe in the next video.
Textbook link
This video corresponds to the textbook section on confounders. https://rafalab.github.io/dsbook/association-is-not-causation.html#confounders
Key points
If X and Y are correlated, we call Z a confounder if changes in Z causes changes in both X and Y.
Code
library(dslabs) data(admissions) admissions
admissions %>% group_by(gender) %>% summarize(percentage = round(sum(admitted*applicants)/sum(applicants),1))
admissions %>% group_by(gender) %>% summarize(total_admitted = round(sum(admitted / 100 * applicants)), not_admitted = sum(applicants) - sum(total_admitted)) %>% select(-gender) %>% do(tidy(chisq.test(.)))
admissions %>% select(major, gender, admitted) %>% spread(gender, admitted) %>% mutate(women_minus_men = women - men)
admissions %>% group_by(major) %>% summarize(major_selectivity = sum(admitted * applicants) / sum(applicants), percent_women_applicants = sum(applicants * (gender==“women”)) / sum(applicants) * 100) %>% ggplot(aes(major_selectivity, percent_women_applicants, label = major)) + geom_text()
admissions %>% mutate(yes = round(admitted/100*applicants), no = applicants - yes) %>% select(-applicants, -admitted) %>% gather(admission, number_of_students, -c(“major”, “gender”)) %>% ggplot(aes(gender, number_of_students, fill = admission)) + geom_bar(stat = “identity”, position = “stack”) + facet_wrap(. ~ major)
admissions %>% mutate(percent_admitted = admitted * applicants/sum(applicants)) %>% ggplot(aes(gender, y = percent_admitted, fill = major)) + geom_bar(stat = “identity”, position = “stack”)
admissions %>% ggplot(aes(major, admitted, col = gender, size = applicants)) + geom_point()
admissions %>% group_by(gender) %>% summarize(average = mean(admitted))
library(tidyverse)
library(dbplyr)
##
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
##
## ident, sql
library(broom)
# UC-Berkeley admission data
library(dslabs)
data(admissions)
admissions
## major gender admitted applicants
## 1 A men 62 825
## 2 B men 63 560
## 3 C men 37 325
## 4 D men 33 417
## 5 E men 28 191
## 6 F men 6 373
## 7 A women 82 108
## 8 B women 68 25
## 9 C women 34 593
## 10 D women 35 375
## 11 E women 24 393
## 12 F women 7 341
admissions %>%
group_by(gender) %>%
#filter(gender=='men') %>% # 44.51951
summarize(percentage = sum(admitted*applicants)/sum(applicants))
## # A tibble: 2 × 2
## gender percentage
## <chr> <dbl>
## 1 men 44.5
## 2 women 30.3
#=========================== Why we are multiple applicants here ??????
admissions %>%
group_by(gender) %>%
summarize(total_admitted = sum(admitted/applicants*100),
not_admitted = sum(applicants) - sum(total_admitted)) %>%
select(-gender) %>%
do(tidy(chisq.test(.))) # What is this test ??
## # A tibble: 1 × 4
## statistic p.value parameter method
## <dbl> <dbl> <int> <chr>
## 1 423. 6.91e-94 1 Pearson's Chi-squared test with Yates' continuit…
# ========================================================
(6+28+33+37+63+62)/(825+560+325+417+191+373) # men$admitted/men$applicants
## [1] 0.0851
(6+28+33+37+63+62)/(6+28+33+37+63+62+82+68+34+35+24+7) # men$admitted/total$admitted
## [1] 0.478
(82+68+34+35+24+7)/(6+28+33+37+63+62+82+68+34+35+24+7) # women$admitted/total$admitted
## [1] 0.522
sum(admissions$admitted[admissions$gender=='women']) /
sum(admissions$applicants[admissions$gender=='women'])
## [1] 0.136
sum(admissions$admitted[admissions$gender=='men']) /
sum(admissions$applicants[admissions$gender=='men'])
## [1] 0.0851
sum(admissions$admitted[admissions$gender=='women'])/sum(admissions$admitted)
## [1] 0.522
admissions %>%
select(major, gender, admitted) %>%
spread(gender, admitted) %>%
mutate(women_minus_men = women - men)
## major men women women_minus_men
## 1 A 62 82 20
## 2 B 63 68 5
## 3 C 37 34 -3
## 4 D 33 35 2
## 5 E 28 24 -4
## 6 F 6 7 1
library(ggplot2)
admissions %>%
group_by(major) %>%
summarize(major_selectivity = sum(admitted*applicants)/sum(applicants),
percent_women_applicants = sum(applicants*(gender=='women')/sum(applicants))*100) %>%
# Why we are doing this way? why we are using summarize function?
ggplot(aes(major_selectivity, percent_women_applicants, label=major)) +
geom_text()
# The plot suggests that women were much more likely to apply to the two
hard majors.
gender bias claim would be based on the fact that this probability is higher when X is women
so whats is the major selectivity
I don’t understand here, why we are doing it this way. The total percent admitted to each major and percent of women make up the applicants??? And the equation used to calculate major_selectivity
We have just seen an example of Simpson’s paradox. It is called a paradox because we see the sign of the correlation flip when we computed on the entire population and when we computed on specific strata. Now, we’re going to use a very illustrative simulated example to show you how this can happen. Suppose you have three variables– x, y, and z. Here’s a scatter plot of y versus x. You can see that x and y are negatively correlated. However, once we stratify by z–the confounder which we haven’t looked at yet–we see that another pattern emerges.
This plot, the different strata defined by disease, are shown in different colors. If you compute the correlation in each strata, you see that the correlations are now positive. So it’s really z that is negatively correlated with x. If we stratify by z, the x and y are actually positively correlated. This is an example of Simpson’s paradox.
Textbook link
This video corresponds to the textbook section on Simpson’s paradox. https://rafalab.github.io/dsbook/association-is-not-causation.html#simpsons-paradox
Key point
[][* Simpson’s Paradox happens when we see the sign of the correlation flip when comparing the entire dataset with specific strata. *]
So it’s really z that is negatively correlated with x.
Comprehension Check due Jun 26, 2022 00:29 AWST
0/1 point (graded)
In the videos, we ran one million tests of correlation for two random variables, X and Y. How many of these correlations would you expect to have a significant p-value ( ), just by chance? 5,000 50,000 100,000 It’s impossible to know incorrect Answer # =================================================================================================================== [][Incorrect: Try again - we can in fact estimate the number of significant correlations we see by chance. Remember that the p-value is defined as the probability of finding the observed result when the null hypothesis (no correlation) is true. When we have a p-value of 0.05, this means the chance of finding a correlation when none exists is 5% - e.g., 0.051,000,000 chances.*] You have used 1 of 2 attempts Some
0.75/1 point (graded) Which of the following are examples of p-hacking?
Select ALL that apply. Looking for associations between an outcome and several exposures and only reporting the one that is significant. Trying several different models and selecting the one that yields the smallest p-value. Repeating an experiment multiple times and only reporting the one with the smallest p-value. Using a Monte Carlo simulations in an analysis. partially correct Answer Incorrect: Correct, this is one of the three examples of multiple testing included in the video. Correct, this is one of the three examples of multiple testing included in the video. You have used 1 of 2 attempts Some
1/1 point (graded) The Spearman correlation coefficient is robust to outliers because: It drops outliers before calculating correlation. It is the correlation of standardized values. It calculates correlation between ranks, not values. correct You have used 1 of 1 attempt Some
1/1 point (graded) What can you do to determine if you are misinterpreting results because of a confounder? Nothing. If the p-value says the result is significant, then it is. More closely examine the results by stratifying and plotting the data. Always assume that you are misinterpreting the results. Use linear models to tease out a confounder. correct Answer Correct: Correct. Although you can sometimes use linear models, you can’t always and exploratory data analysis (stratifying and plotting data) will help determine if there is a confounder. You have used 1 of 2 attempts Some
1/1 point (graded)
Look again at the admissions data presented in the confounders video using ?admissions. What important characteristic of the table variables do you need to know to understand the calculations used in this video? The data are from 1973. The columns major and gender are of class character, while admitted and applicants are numeric. The data are from the dslabs package. The column admitted is the percent of students admitted, while the column applicants is the total number of applicants. correct Answer Correct: Correct. In all data science projects, it is important to understand the data that you are working with.
Explanation
[][Several of these statements are true but not relevant to understanding the calculations in the video. The only statement that is critical for the analysis is that “The column admitted is the percent of students admitted, while the column applicants is the total number of applicants.” In all data science projects, it is important to understand the data that you are working with.] You have used 1 of 2 attempts Some
1/1 point (graded) In the example in the confounders video, major selectivity confounds the relationship between UC Berkeley admission rates and gender because: It was harder for women to be admitted to UC Berkeley. Major selectivity is associated with both admission rates and with gender, as women tended to apply to more selective majors. Some majors are more selective than others. Major selectivity is not a confounder. correct
Explanation
Major selectivity is a confounder because it is associated with both admission rate and with gender. You have used 1 of 2 attempts Some
0/1 point (graded) Admission rates at UC Berkeley are an example of Simpson’s Paradox because: It appears that men have higher a higher admission rate than women, however, after we stratify by major, we see that on average women have a higher admission rate than men. correct It was a paradox that women were being admitted at a lower rate than men. The relationship between admissions and gender is confounded by major selectivity. incorrect Answer Incorrect: Incorrect. This is true, but, Simpson’s Paradox refers specifically to cases where the sign of the correlation flips when comparing the entire dataset vs. specific strata.
Explanation
Simpson’s Paradox refers specifically to cases where the sign of the correlation flips when comparing the entire dataset vs. specific strata, so only the first statement is correct. You have used 1 of 1 attempt
Ask your questions or make your comments about Correlation is Not Causation here! Remember, one of the best ways to reinforce your own learning is by explaining something to someone else, so we encourage you to answer each other’s questions (without giving away the answers, of course).
Some reminders:
Search the discussion board before posting to see if someone else has asked the same thing before asking a new question
Please be specific in the title and body of your post regarding which question you're asking about to facilitate answering your question.
Posting snippets of code is okay, but posting full code solutions is not.
If you do post snippets of code, please format it as code for readability. If you're not sure how to do this, there are instructions in a pinned post in the "general" discussion forum.
Discussion: Correlation is Not Causation Topic: Section 3: Confounding / 3.1: Correlation is Not Causation
UC Berkeley Admission admission 44% and 30% does not make sense
question posted about 4 hours ago by john_hhu2020
admissions %>% group_by(gender) %>% summarize(percentage = sum(admitted*applicants)/sum(applicants))
I think the way to calculate its should be 1>divide data into men and women groups, 2>for each group calculate % using sum of admitted divide by sum of applicants, 3>that will gives us 13% for women and 8% for men. I’m confusing on this for a while, please help
Assessment due Jun 26, 2022 00:29 AWST
For this set of exercises, we examine the data from a 2014 PNAS paper that analyzed success rates from funding agencies in the Netherlands and concluded: http://www.pnas.org/content/112/40/12349.abstract
“our results reveal gender bias favoring male applicants over female applicants in the prioritization of their”quality of researcher” (but not “quality of proposal”) evaluations and success rates, as well as in the language used in instructional and evaluation materials.”
A response was published a few months later titled No evidence that gender contributes to personal research funding success in The Netherlands: A reaction to Van der Lee and Ellemers, which concluded: http://www.pnas.org/content/112/51/E7036.extract
However, the overall gender effect borders on statistical significance, despite the large sample. Moreover, their conclusion could be a prime example of Simpson’s paradox; if a higher percentage of women apply for grants in more competitive scientific disciplines (i.e., with low application success rates for both men and women), then an analysis across all disciplines could incorrectly show “evidence” of gender inequality.
Who is right here: the original paper or the response? Here, you will examine the data and come to your own conclusion.
The main evidence for the conclusion of the original paper comes down to a comparison of the percentages. The information we need was originally in Table S1 in the paper, which we include in dslabs:
library(dslabs) data(“research_funding_rates”) research_funding_rates
library(dslabs)
data("research_funding_rates")
research_funding_rates
## discipline applications_total applications_men applications_women
## 1 Chemical sciences 122 83 39
## 2 Physical sciences 174 135 39
## 3 Physics 76 67 9
## 4 Humanities 396 230 166
## 5 Technical sciences 251 189 62
## 6 Interdisciplinary 183 105 78
## 7 Earth/life sciences 282 156 126
## 8 Social sciences 834 425 409
## 9 Medical sciences 505 245 260
## awards_total awards_men awards_women success_rates_total success_rates_men
## 1 32 22 10 26.2 26.5
## 2 35 26 9 20.1 19.3
## 3 20 18 2 26.3 26.9
## 4 65 33 32 16.4 14.3
## 5 43 30 13 17.1 15.9
## 6 29 12 17 15.8 11.4
## 7 56 38 18 19.9 24.4
## 8 112 65 47 13.4 15.3
## 9 75 46 29 14.9 18.8
## success_rates_women
## 1 25.6
## 2 23.1
## 3 22.2
## 4 19.3
## 5 21.0
## 6 21.8
## 7 14.3
## 8 11.5
## 9 11.2